## ALEXANDER **ENGELMANN**

# **DISTRIBUTED** OPTIMIZATION WITH **APPLICATION** TO **POWER SYSTEMS** AND **CONTROL**

A. **ENGELMANN**

**DISTRIBUTED** OPTIMIZATION WITH **APPLICATION** TO **POWER SYSTEMS** AND **CONTROL**

Alexander Engelmann

Distributed Optimization with Application to Power Systems and Control

# Distributed Optimization with Application to Power Systems and Control

by Alexander Engelmann

Karlsruher Institut für Technologie Institut für Automation und angewandte Informatik

Distributed Optimization with Application to Power Systems and Control

Zur Erlangung des akademischen Grades eines Doktor-Ingenieurs von der KIT-Fakultät für Informatik des Karlsruher Instituts für Technologie (KIT) genehmigte Dissertation

von Alexander Engelmann, M.Sc.

Tag der mündlichen Prüfung: 21. Oktober 2020 Referenten: Prof. Dr.-Ing. Timm Faulwasser Korreferenten: Prof. Dr.-Ing. Veit Hagenmeyer Prof. Colin Neil Jones, PhD

**Impressum**

Karlsruher Institut für Technologie (KIT) KIT Scientific Publishing Straße am Forum 2 D-76131 Karlsruhe

KIT Scientific Publishing is a registered trademark of Karlsruhe Institute of Technology. Reprint using the book cover is not allowed.

www.ksp.kit.edu

*This document – excluding parts marked otherwise, the cover, pictures and graphs – is licensed under a Creative Commons Attribution-Share Alike 4.0 International License (CC BY-SA 4.0): https://creativecommons.org/licenses/by-sa/4.0/deed.en*

*The cover page is licensed under a Creative Commons Attribution-No Derivatives 4.0 International License (CC BY-ND 4.0): https://creativecommons.org/licenses/by-nd/4.0/deed.en*

Print on Demand 2022 – Gedruckt auf FSC-zertifiziertem Papier

ISBN 978-3-7315-1180-9 DOI 10.5445/KSP/1000144792

# **Abstract**

In many engineering domains, systems are composed of partially independent subsystems—power systems are composed of distribution and transmission systems, teams of robots are composed of individual robots, and chemical process systems are composed of vessels, heat exchangers and reactors. Often, these subsystems should reach a common goal such as satisfying a power demand with minimum cost, flying in a formation, or reaching an optimal set-point. At the same time, limited information exchange is desirable—for confidentiality reasons but also due to communication constraints. Moreover, a fast and reliable decision process is key as applications might be safety-critical.

Mathematical optimization techniques are among the most successful tools for controlling systems optimally with feasibility guarantees. Yet, they are often centralized—all data has to be collected in one central and computationally powerful entity. Methods from distributed optimization control the subsystems in a distributed or decentralized fashion, reducing or avoiding central coordination. These methods have a long and successful history. Classical distributed optimization algorithms, however, are typically designed for convex problems. Hence, they are only partially applicable in the above domains since many of them lead to optimization problems with non-convex constraints.

This thesis develops one of the first frameworks for distributed and decentralized optimization with non-convex constraints. Based on the Augmented Lagrangian Alternating Direction Inexact Newton (ALADIN) algorithm, a bilevel distributed ALADIN framework is presented, solving the coordination step of ALADIN in a decentralized fashion. This framework can handle various decentralized inner algorithms, two of which we develop here: a decentralized variant of the Alternating Direction Method of Multipliers (ADMM) and a novel decentralized Conjugate Gradient algorithm. Decentralized conjugate gradient is to the best of our knowledge the first decentralized algorithm with a guarantee of convergence to the exact solution in a finite number of iterates. Sufficient conditions for fast local convergence of bi-level ALADIN are derived. Bi-level ALADIN strongly reduces the communication and coordination effort of ALADIN and preserves its fast convergence guarantees. We illustrate these properties on challenging problems from power systems and control, and compare performance to the widely used ADMM.

The developed methods are implemented in the open-source MATLAB toolbox ALADIN-—one of the first toolboxes for decentralized non-convex optimization. ALADIN- comes with a rich set of application examples from different domains showing its broad applicability. As an additional contribution, this thesis provides new insights why state-of-the-art distributed algorithms might encounter issues for constrained problems.

# **Deutsche Kurzfassung**

In vielen Anwendungen bestehen Systeme aus einer Vielzahl von Subsystemen. Elektrische Energiesysteme bestehen aus Transportsystemen und Verteilsystemen, Roboter Teams bestehen aus einzelnen Robotern und chemische Prozesssysteme bestehen aus einzelnen Kesseln, Wärmetauschern und Reaktoren. Oftmals arbeiten diese Subsysteme auf ein gemeinsames Ziel hin, wie zum Beispiel die möglichst kostengünstige Deckung eines Energiebedarfs, das Fliegen in einer bestimmten Formation oder die kostenoptimale Ansteuerung eines Arbeitspunktes. Dabei ist ein möglichst geringer Informationsaustausch wünschenswert—sei es aus Vertraulichkeitsgründen oder auch aus Gründen limitierter Datenübertragungskapazität. Zusätzlich ist ein schneller und zuverlässiger Entscheidungsprozess essentiell—besonders im Falle sicherheitskritischer Anwendungen.

Methoden der mathematischen Optimierung gehören zu den erfolgreichsten Werkzeugen für die optimale Steuerung von Systemen mit Garantien. Noch sind diese Verfahren oftmals zentralisiert—alle Daten werden in einer zentralen, koordinierenden Einheit mit hoher Rechenleistung gesammelt. Methoden der verteilten Optimierung steuern Subsysteme auf verteilte oder dezentrale Weise unter Reduktion oder Vermeidung zentraler Koordination. Diese Methoden haben eine lange und erfolgreiche Historie. Klassische verteilte Optimierungsverfahren wurden in den meisten Fällen für konvexe Probleme entwickelt. Damit sind sie jedoch nur teilweise auf Probleme in den oben erwähnten Domänen anwendbar, da viele von ihnen auf Optimierungsprobleme mit nichtkonvexen Nebenbedingungen führen.

Die vorliegende Arbeit entwickelt eine der ersten Verfahrensklassen für die verteilte und dezentrale Optimierung unter nichtkonvexen Nebenbedingungen. Basierend auf dem Augmented Lagrangian Alternating Direction Inexact Newton (ALADIN) Algorithmus wird ein zweistufiges ALADIN Verfahren präsentiert, welches den Koordinationsschritt ALADIN's dezentral löst. Diese Verfahrensklasse ist in der Lage mit verschiedenen inneren dezentrale Verfahren umzugehen, wobei zwei solcher Verfahren vorgestellt werden: eine dezentrale Variante des Alternating Direction of Multipliers Method (ADMM) Algorithmus und eine dezentrale Variante des konjugierte Gradienten Verfahrens. Das dezentrale konjugierte Gradienten Verfahren ist nach unserem Kenntnisstand das erste dezentrale Verfahren mit einer Konvergenzgarantie zur exakten Lösung eines Koordinationsproblems in einer endlichen Anzahl von Schritten. Hinreichende Bedingungen für die schnelle lokale Konvergenz des zweistufigen ALADIN Verfahrens werden hergeleitet. Zusätzlich zu seinen schnellen Konvergenzeigenschaften reduziert das zweistufige Verfahren den Kommunikations- und Koordinationsbedarf ALADIN's. Wir verdeutlichen diese Eigenschaften anhand herausfordernder Anwendungsbeispiele, die in elektrischen Energiesystemen sowie in der Optimalsteuerung auftreten, und vergleichen die Leistungsfähigkeit zu dem weitverbreiteten ADMM Verfahren.

Die entwickelten Verfahren sind in der quelloffenen MATLAB Toolbox ALADIN- implementiert—eine der ersten Toolboxen für die dezentrale nichtkonvexe Optimierung. ALADIN- verfügt über vielfältige Anwendungsbeispiele kommend aus verschiedenen Domänen, welches ihre breite Anwendbarkeit unterstreicht. Als zusätzlichen Beitrag dokumentiert diese Arbeit neue Einsichten über die Gründe, warum aktuelle verteilte Optimierungsverfahren in manchen Fällen Schwierigkeiten beim Lösen von Optimierungsprobleme unter Nebenbedingungen haben.

# **Acknowledgment**

I would like to thank the many individuals, who supported me in the last three years of my PhD journey. Without your continuous support, this thesis would certainly not have been possible.

First of all, I would like to thank my supervisor Prof. Dr. Timm Faulwasser for his continuous excellent supervision and support, the many opportunities for international scientific exchange and also for his trust and the freedom in being able to pursue side projects, many of which worked out very well. Furthermore, I would like to thank Prof. Dr. Veit Hagenmeyer, who supported me especially in the last months of my PhD time. I would also like to thank Prof. Conlin Jones for serving as an excellent examiner of my thesis and also Prof. Snelting, Prof. Beckert and Prof. Abeck for participating in my committee.

Furthermore, I would like to thank all my colleagues from the Institute for Automation and applied Informatics. Riccardo, Till and Alex, you were and are not only colleagues to me, but also close friends. With you, Riccardo, it was always fun to discuss about many things far beyond research (but also related to that). We were the "power engineers" in our group, which led to many interesting and fruitful discussions with the "math and control engineers". With Till, my office mate, I remember countless exciting discussions about mathematical subjects far beyond the scope of my thesis. I really enjoyed this time and would also like to thank you for your support in difficult situations. Alex, also with you I enjoyed many discussions about research, politics and many other subjects. Furthermore, I would like to thank many other colleagues from IAI for the good relationship in the past years—both professionally and privately (in alphabetic order): Benedikt, Bernadette, Claudia, Clemens, Dominique, Frederik, Heiko, Henrieke, Jürgen, Lutz, Philipp, Sabine and Shadi. It was always fun to work with you and I really enjoyed the positive atmosphere in the insitute—keep it on! Supervising students was a highlight for me: thank you Xinlinang, Ruchuan and Sina for our time spent together!

Furthermore, I would like to thank my scientific partners—especially the group of Prof. Boris Houska from ShanghaiTech. I really enjoyed the time we spent together—especially at my visit in Shanghai. Thank you Boris for your continuous excellent support, feedback and encouragement. Yuning, we started our PhD essentially at the same time and had a close collaboration ever since. I enjoyed your creativity and thank you for the many projects we successfully finished together. I would also like to thank you, Xu, for exploring many interesting new research directions. Furthermore, I would like to thank my partners from the SmiLES project, thank you (in alphabetic order) Benedikt, Edmund, Isabelle, Oliver, Tue, Thomas, and Zhichao for all the good times spent together in various workshops and after-workshop events.

The next few sentences dedicated to the most important perople in my life are written in german. Liebe Mama, lieber Papa, ich möchte euch für eure Unterstützung danken, ich kann immer auf euch zählen, das ist unendlich wertvoll für mich. Juliane, Simon, Oma, Opa, ihr seid sehr wichtig in meinem Leben ihr erdet mich und steht mir immer zur Seite. Auch viele meiner Freunde haben mich bei vielen Schwierigkeiten auf meiner Reise sehr unterstützt—hier möchte mich vor allen Dingen bei dir, Nico, bedanken. Du bist der Beste, du warst immer da, wenn es die ein oder andere Konfliktreiche Situation zu umschiffen galt und hast mir eine sehr wichtige äußere Perspektive auf die Dinge gegeben. Ich konnte immer auf dich zählen. Danke dafür!

Oktober 2020 Alexander Engelmann

# **Acronyms**


# **Frequently Used Symbols**

## **Distributed optimization**


## **Power systems**


# **Content**




# **1 Introduction**

In many engineering domains, one can observe a trend towards systems composed of interconnected subsystems coordinating towards a common goal. Examples range from power systems, which collaborate to satisfy power demands, via robotics, collaborating to fulfill a certain task, to chemical engineering, where reactors collaborate to produce a certain product in an optimal fashion. Because of confidentiality reasons and communication constraints, this collaboration should be achieved subject to limited information exchange and limited central coordination while being fast and reliable.

One particular example for such systems are electricity grids. Electricity grids are typically composed of smaller grids, each of which is operated by one system operator. All together, they aim at a cheap and safe energy supply. To achieve this goal, it is usually necessary to exchange grid models and consumption data. Exchanging this data is, however, often problematic because of confidentiality reasons. Moreover, avoiding a single point of failure (e.g. a central coordinator) is also important for security reasons.

This thesis is about operating such systems by means of mathematical optimization techniques.

### **Employing mathematical optimization**

Mathematical optimization is one of the most successful techniques for controlling systems automatically and optimally towards a goal. Mathematical optimization problems are typically formulated as

$$\min\_{\mathbf{x}} f(\mathbf{x}) \quad \text{subject to} \quad \mathbf{x} \in \mathcal{X}, \tag{1.1}$$

where : X → R is called the *objective function*, X is called the *constraint set* or *feasible set* and is called the *decision vector*. The objective function

Figure 1.1: Distributed optimization, decentralized and centralized optimization.

 encodes the goal (such as minimizing the cost of power generation in an electricity grid) and the constraint set X captures the underlying physical model (such as an electricity grid model) and engineering limitations (such as maximum generator outputs). Problems in form of (1.1) are typically solved by *centralized* optimization algorithms in the sense that the data of all subsystems is collected in one central entity. This entity solves the problem and broadcasts the solution to the subsystems (Figure 1.1a).

Centralized optimization is often not desirable because of information aggregation in a single entity, which implies the existence of a single point of failure. Hence, techniques from *distributed* optimization are important. Distributed optimization means to shift computation mainly to the subsystems with a coordinating entity (Figure 1.1b).<sup>1</sup> Distributed optimization algorithms typically require problems, where and X have a *special structure*. One common structure is

$$\min\_{\mathbf{x}} \sum\_{i \in \mathcal{R}} f\_i(\mathbf{x}\_i) \quad \text{subject to} \quad \mathbf{x}\_i \in \mathcal{X}\_i, \quad i \in \mathcal{R} \quad \text{and} \quad \sum\_{i \in \mathcal{R}} A\_i \mathbf{x}\_i = b. \tag{1.2}$$

<sup>1</sup> Note that, although the terminology of "distributed algorithms" might be used slightly different in computer science, there is an intimate connection between these two fields. We briefly comment on this interconnection in Appendix D.

Here, the optimization problem is distributed among a set of subsystems R = {1, . . . , }, where each subsystem has its own objective function (for example encoding the generator cost of one system operator), its own decision vector (encoding voltages and powers in the region controlled by the system operator) and its own constraint set X (containing the grid model and engineering constraints of one region). Coupling between the subsystems is considered via a so-called consensus constraint Í ∈R = . This consensus constraint encodes the need for compatible physical values at the interconnection points between two neighboring subsystems (such as a matching power import/export between neighboring system operators).

While distributed optimization reduces the amount of central coordination, it still requires a coordinator. The appealing promise of *decentralized* optimization algorithms is to overcome the need for this coordinator, i.e. they avoid central coordination entirely and exchange information directly between neighbored subsystems (Figure 1.1c).

Although distributed and especially decentralized optimization algorithms are promising candidates for controlling systems with limited information exchange, classical distributed and decentralized algorithms are typically designed for *convex* problems. In many practical applications, however, models are non-linear leading to *non-convex* constraint sets. Hence, classical distributed and decentralized algorithms are often not applicable—at least not with convergence guarantees. Moreover, classical algorithms are often slow in practice.

In view of the above, the main research question of this thesis is as follows:

*How to design efficient decentralized optimization algorithms for problems with non-convex constraints under limited information exchange and guaranteeing fast convergence?*

## **Outline and contributions**

Next, we outline the content of this thesis and highlight contributions of each chapter.

## **Chapter 2—Basics of Distributed Optimization**

Chapter 2 briefly recalls basics of distributed optimization. We start with the fundamentals of nonlinear programming such as optimality conditions and prominent nonlinear programming algorithms. We continue by recalling two popular classical distributed optimization algorithms serving as building blocks for the algorithms we derive in Chapter 4 and serving as benchmark algorithms for numerical tests in Chapter 5. We recall the basic form of the Augmented Lagrangian Alternating Direction Inexact Newton (ALADIN) algorithm, which builds the foundation of the bi-level ALADIN algorithm, which we will develop in Chapter 4. Chapter 2 concludes with new insights why classical distributed algorithms might exhibit severe issues for constrained problems. We show that ALADIN is able to overcome these limitations by its more advanced coordination step, which explicitly considers constraint and curvature information.

## **Chapter 3—A survey on Distributed Optimization**

In Chapter 3, we provide a literature review on distributed and decentralized optimization algorithms from different fields. The review is new in this form, since most literature reviews consider distributed algorithms from one particular community only. We conclude Chapter 3 by showing that most of the approaches so far have difficulties especially for problems with non-convex constraints.

## **Chapter 4—Bi-level Distributed ALADIN**

Chapter 4 presents the main conceptual contribution of this thesis: a bi-level distributed ALADIN framework combining basic ALADIN with condensing techniques from nonlinear programming and an inner decentralization layer.<sup>2</sup> By using condensing techniques, we significantly lower coordination and com-

<sup>2</sup> For the sake of readability, we will simply write bi-level ALADIN in the following. Note that we do not make any connection to bi-level optimization problems in the sense of nested optimization problems [CMS07].

munication requirements of ALADIN. Decentralization is achieved by solving the coordination step by means of decentralized inner algorithms. To this end, we propose a decentralized variant of ADMM and, as an alternative, a novel decentralized variant of the conjugate gradient algorithm for solving the coordination step of ALADIN. Decentralized conjugate gradient has the advantage of converging in a finite number of iterations, while ADMM achieves at most linear convergence. We derive both algorithms based on a new sparsity encoding technique. As these two inner algorithms are both iterative in nature, they introduce inexactness to the coordination step of bi-level ALADIN. We show that bi-level ALADIN, nonetheless, preserves the fast convergence guarantees of basic ALADIN if the inaccuracy in the coordination step of ALADIN decays fast enough.

In summary, the contributions of Chapter 4 are


The results of this chapter appeared in [Eng+20]. The sparsity encoding technique is not published so far.

## **Chapter 5—Application to Power Systems**

Chapter 5 applies ALADIN and bi-level ALADIN to Optimal Power Flow (OPF) problems—an important problem class from power systems. We start by reviewing relevant literature on distributed OPF and conclude that state-ofthe art algorithms such as ADMM often lack convergence guarantees and often exhibits slow convergence. As an alternative, we propose ALADIN variants and compare ALADIN, condensed ALADIN and bi-level ALADIN to ADMM as a state-of-the-art algorithms for distributed OPF. We provide a numerical comparison in terms of convergence, coordination and communication for practically relevant cases. Moreover, we show that ADMM leads to feasible but not necessarily to optimal solutions in case of high penalization parameters in combination with a feasible initial guess. This is particularly important in the context of OPF as the combination of these two assumptions is sometimes used in the distributed OPF literature.

In summary, the contributions of Chapter 4 are


The results of this chapter have been published in [Eng+19b; Eng+17; Eng+20; EF18], where [Eng+19b] mainly focuses on numerical tests on cases up to 300 buses. [EF18] analyzes the convergence of ADMM in case of high penalization parameters in combination with a feasible initial point.

## **Chapter 6—The ALADIN- toolbox**

Chapter 6 presents an open-source toolbox, ALADIN-, implementing AL-ADIN, bi-level ALADIN and the Alternating Direction Method of Multipliers (ADMM) in a unified framework. This toolbox is designed to be modular having rapid-prototyping of distributed and decentralized optimization algorithms in mind. ALADIN- supports advanced features such as parallel computing and parametric programming. The toolbox comes with a rich set of examples from different engineering fields highlighting its broad applicability. Specifically, we illustrate possible applications of ALADIN- beyond power systems on a numerical example from distributed optimal control. ALADIN- is one of the first toolboxes for decentralized non-convex optimization.

More explicitly, the contributions of ALADIN- can be summarized as


The results of this chapter have been submitted for publication in [Eng+20].

Chapter 7 summarizes this thesis and proposes promising directions of future work.

# **2 Basics of Distributed Optimization**

This chapter briefly reviews basics of *nonlinear programming*, where and X in problem (1.1) are described by continuously differentiable functions. We also describe widely-used distributed optimization algorithms, where the derivation of dual decomposition is based on [Boy+11] and [BT89]. The derivation of ADMM comes from [Hou+17] with the difference that here we explicitly consider equality constraints. The optimality conditions are mainly from [NW06] and the illustrating examples in Section 3.4 are novel.

## **2.1 Basics of nonlinear programming**

*Nonlinear programs (NLPs)* are special cases of (1.1) in form of

$$\min\_{\mathbf{x}\in\mathbb{R}} f(\mathbf{x})\tag{2.1a}$$

subject to ∈ X = { ∈ R | () = 0, ℎ() ≤ 0} (2.1b)

where : R → R and ℎ : R → R <sup>ℎ</sup> are called *equality* and *inequality* constraints and ∈ R is called the *decision vector* of (2.1).<sup>1</sup> Here, , and ℎ are assumed to be sufficiently often continuously differentiable. Let us endow the constraint set X with a norm ∥ · ∥ yielding a normed space (X, ∥ · ∥). Then we can define minimizers to problem (2.1) as follows.

*Definition 1 (Minimizers)* We call ★ ∈ a *local minimizer* to problem (1.1), if ( ★) ≤ () for all ∈ B ( ★) ∩ X with B (¯) = { | ∥ −¯∥ < }, > 0. If ( ★) ≤ () for *all* ∈ X, we call ★ a *global minimizer* to problem (1.1).

<sup>1</sup> Other special cases are *integer* or *mixed-integer* problems for example, where X ⊆ R × Z ; or *infinite-dimensional* problems, where X is a subset of a function space. We refer to [Grö18] for an overview many important classes of practical relevance.

If the above inequalities hold strictly, we call ★ a *strict* local minimizer respectively *strict* global minimizer. We call A (¯) := { ∈ {1, . . . , ℎ} | ℎ() = 0} the set of *active* or *binding* inequality constraints at a point ¯. We call problem (2.1) *feasible*, if X is non-empty.

## **2.1.1 Optimality conditions**

How can one find minimizers to (2.1)? One approach is to derive *optimality conditions* which (when solved) provide solution candidates to (2.1). Many numerical algorithms rely on the first-order necessary Karush-Kuhn-Tucker (KKT) conditions, which are a set of non-linear equations and nonlinear inequalities providing candidate solution points to (2.1). To ensure that the KKT consitions hold true at a minimizer, one relies on constraint qualifications. Here we recall the strongest constraint qualification—the *linear independence constraint qualification* (LICQ).<sup>2</sup>

*Definition 2 (Linear independence constraint qualification)* Let ¯ be a feasible point to (2.1). LICQ is said to hold at ¯, if ∇ℎ(¯) and ∇(¯) are linearly independent for all ∈ A (¯) and for all ∈ {1, . . . , }.

Now, we are ready to state the KKT conditions.

<sup>2</sup> Weaker constraint qualifications exist, for example the Mangasarian-Fromovitz Constraint Qualification (MFCQ) or the Abadie Constraint Qualification. Moreover, there are first-order optimality conditions which do not require constraint qualifications, for example the Fritz-John optimality conditions [GGT04; BSS13]. However, these conditions are more difficult to evaluate and thus we stick with the KKT conditions and LICQ which is a common approach in the literature.

*Theorem 1 (First-order optimality KKT conditions)* Let , and ℎ be continuously differentiable and let ★ be a minimizer to (2.1) for which LICQ holds. Then there exist unique vectors ★ ∈ R and ★ ∈ R <sup>ℎ</sup> such that

$$0 = \nabla f(\mathbf{x}^{\star}) + \sum\_{i \in \{1, \ldots, n\_{\otimes}\}} \gamma\_i^{\star} \nabla g\_i(\mathbf{x}^{\star}) + \sum\_{j \in \{1, \ldots, n\_h\}} \mu\_i^{\star} \nabla h\_i(\mathbf{x}^{\star}), \qquad (2.2a)$$

$$0 = \mathbf{g}\left(\mathbf{x}^{\star}\right),\tag{2.2b}$$

$$0 \ge h(x^{\star}),\tag{2.2c}$$

$$
\mu^\star \ge 0,\tag{2.2d}
$$

$$0 = \mu\_i^\star h\_i(\mathbf{x}^\star) \quad \text{for all} \quad i \in \{i, \dots, n\_h\}. \tag{2.2e}$$

We call ( ★, ★, ★) a *KKT point*, if it satisfies the KKT conditions (2.2).

Next, we recall sufficient conditions for optimality. Consider the *Lagrangian* function to problem (2.1)

$$\mathcal{L}(\boldsymbol{x}, \boldsymbol{\gamma}, \boldsymbol{\mu}) := \boldsymbol{f}(\boldsymbol{x}) + \boldsymbol{\gamma}^{\top} \boldsymbol{g}(\boldsymbol{x}) + \boldsymbol{\mu}^{\top} \boldsymbol{h}(\boldsymbol{x}),$$

$$\text{where } \gamma^\top = (\gamma\_1, \dots, \gamma\_{n\_\mathcal{B}}) \text{ and } \mu^\top = (\mu\_1, \dots, \mu\_{n\_\mathcal{h}}).$$

*Theorem 2 (Second-order sufficient condition [NW06, Thm 12.6])* Let ★ be a KKT point where LICQ holds. Morover, let ∇ 2 L ( ★, ★, ★) be positive definite on the subspace { ∈ R | ∇( ★) = 0}. Then, ★ is a strict local minimizer to problem (2.1).<sup>3</sup>

Here, we denote the Jacobian matrix of as ∇() = ∇<sup>1</sup> (), . . . , ∇ () ⊤ . Many numerical algorithms rely on solving (2.2) for obtaining solution candidates to (2.1). The second-order sufficient conditions can then be used to distinguish minimizers from maximizers and saddle points. This forms the basis for two of the most popular algorithms for nonlinear programming: Sequential Quadratic Programming (SQP) and Interior Point (IP) methods.

<sup>3</sup> Note that the subspace assumption { ∈ R | ∇( ★) = 0} is slightly stronger than the one in the literature, where this subspace can be further reduced by means of the inequality constraint components ℎ which are active at ★. However, this would make the presentation much more technical and would not give much additional generality for our purposes. For a more general version we refer to [NW06, Thm 12.6].

Observe that in the special case of equality constraints only, solving (2.2) reduces to solving a nonlinear system of equations which can be solved via standard Newton methods.

### **Convex problems**

An important class of problems are *convex optimization problems*. Hence, let us recall basics of convex optimization.

*Definition 3 (Convex set)* A set X ⊆ R is called convex if the line between any two points 1, <sup>2</sup> ∈ X lies entirely in X, i.e. <sup>1</sup> + (1 − ) <sup>2</sup> ∈ X for all ∈ (0 1).

If the above inequality holds strictly for all 1, <sup>2</sup> ∈ X, <sup>1</sup> ≠ 2, X is called *strictly convex*.

*Definition 4 (Convex function)* A function : X ↦→ R is called convex if its epigraph is convex, i.e. if for all 1, <sup>2</sup> ∈ X, we have ( <sup>1</sup> + (1 − ) 2) ≤ (1) + (1 − ) (2) for all ∈ (0 1).

If the above inequality holds strictly for all <sup>1</sup> ≠ 2, is called *strictly convex*.

*Definition 5 (Strongly convex function)* If is differentiable and ∇ (1) − ∇ (2) ⊤ (<sup>1</sup> −2) ≥ ∥<sup>1</sup> −<sup>2</sup> ∥ 2 2 holds for all 1, <sup>2</sup> ∈ X, is called strongly convex with parameter .

We denote the set of convex functions with S and the set of strongly convex functions with parameter with S. Note that any strongly convex function is strictly convex but not vice versa.

Convex *optimization problems* are special cases of (1.1), where and X are convex. In case of nonlinear programming problems (2.1), this means that the inequality constraints ℎ are convex and the equality constraints are affine [Boy+04]. A nice property of convex problems is that their minima ( ★) are unique (if they exist). Moreover, if is strongly convex also the minimizer ★ is unique. In addition, the KKT conditions (2.2) are necessary *and* sufficient for convex problems and thus one can omit the evaluation of the conditions of Theorem 2. Furthermore, specialized algorithms for convex optimization problems exist where some of them can be solved in polynomial time. Although we focus on non-convex optimization in this work, we will use these results from convex optimization in some of our derivations.

Next, we recall some basics from convex optimization which are essential to our later developments. In convex optimization, the *dual problem* (which is a maximization problem) to a (primal) minimization optimization problem is often useful since one can recover the solution to the primal problem from the solution to the dual problem under certain conditions. In many cases, the dual problem is considerable easier to solve than the primal problem—thus many algorithms focus on solving the dual problem. Recovering the solution to the primal problem from the dual is possible particularly in absence of a *duality gap*, for which we recall conditions next.

Consider the partial Lagrangian of (2.1) with respect to , L (, ) = () + <sup>⊤</sup>(). Then, the *dual function* to (2.1) is defined as

$$q(\boldsymbol{\gamma}) \coloneqq \inf\_{\boldsymbol{\chi} \in \mathcal{X}} \mathcal{L}(\boldsymbol{\chi}, \boldsymbol{\gamma}) = \inf\_{\boldsymbol{\chi} \in \mathcal{X}} f(\boldsymbol{\chi}) + \boldsymbol{\gamma}^{\top} g(\boldsymbol{\chi}) \tag{2.3}$$

with X := { ∈ R | ℎ() ≤ 0}. Moreover, the dual problem to (2.1) is

$$\sup\_{\gamma} q(\gamma). \tag{2.4}$$

This leads to the following theorem.

*Theorem 3 (Strong duality under Slater's condition [Ber99, Props 5.1.4, 5.1.5, 5.3.1])* Let be convex on X and let and ℎ be convex, let (2.1) be feasible. Moreover, let some point ¯ with ℎ(¯) < 0 exist (Slater's condition). Then,

$$\sup\_{\gamma} q(\gamma) = \inf\_{\mathfrak{g}(\alpha) = 0, h(\alpha) \le 0} f(\alpha),$$

i.e. there is no duality gap between (2.1) and (2.4). Moreover, the set of maximizers to (2.4), D, is the same as the set of dual solutions to (2.1) and the set of primal solutions is given by the minimizers to

$$\min\_{\boldsymbol{x}\in\mathcal{X}}\mathcal{L}(\boldsymbol{x},\boldsymbol{\gamma}^{\star}),\quad\boldsymbol{\gamma}^{\star}\in\mathcal{D}.\tag{2.5}$$

13

Roughly speaking, this means that if the problem at hand is convex and Slater's condition holds, one can "replace" the original problem with it's dual. After solving the dual problem, a primal solution ★ can be recovered by means of (2.5).

*Remark 1 (Slater's condition)* Note that Slater assumption is standard in convex optimization and not overly restrictive. For special problem classes moreover, the assumptions can be relaxed. If ℎ is affine for example, the strict feasibility condition can be relaxed to feasibility [Ber99, Prop 5.2.1]. Thus for feasible and convex quadratic programs for example, Slater's condition is always satisfied.

## **2.1.2 Sequential quadratic programming**

There are four main branches for solving non-linear and possibly non-convex optimization problems in form of (2.1):


and combinations thereof. Gradient-based methods are typically used for large-scale problems, where second-order information is too expensive to evaluate. SQP and IP methods are very successful and exhibit fast convergence to local minimizers in many cases. However, each iteration is more costly compared with gradient-based methods. Augmented Lagrangian methods are often used for large-scale problems and as a building block for other algorithms (also within SQP). Here, we review the fundamental basics of SQP and augmented Lagrangian methods, because both are essential for understanding of the Augmented Lagrangian Alternating Direction Inexact Newton (ALADIN) algorithm. Gradient methods and interior point methods are not considered here—instead we refer to the textbooks [NW06; Ber99; Wri97; Nes13] for details.

SQP algorithms can be derived in two different ways. One is based on the idea of solving (2.1) by sequentially approximating the problem by a quadratic program (QP) where its name comes from. We recall the second way here, viewing SQP as solving the optimality conditions (2.2) by a Newton-type method. In absence of inequality constraints, the KKT conditions (2.2) for problem (2.1) read

$$F(\boldsymbol{x}, \boldsymbol{\gamma}) := \begin{pmatrix} \nabla f(\boldsymbol{x}) + \boldsymbol{\gamma}^{\top} \nabla \boldsymbol{g}(\boldsymbol{x}) \\ \boldsymbol{g}(\boldsymbol{x}) \end{pmatrix} \stackrel{!}{=} 0. \tag{2.6}$$

To solve this problem (and thus finding KKT points), let us apply a Newton-type method defined via the iteration

$$q^{k+1} = q^k - (M^k)^{-1} F(q^k),\tag{2.7}$$

where := ( , ) and is a non-singular approximation of ∇( ). The Newton-type iteration (2.7) applied to (2.6) yields

$$
\begin{pmatrix} B^k & \nabla g(\boldsymbol{x}^k)^\top \\ \nabla g(\boldsymbol{x}^k) & 0 \end{pmatrix} \begin{pmatrix} \boldsymbol{x}^{k+1} - \boldsymbol{x}^k \\ \boldsymbol{\gamma}^{k+1} - \boldsymbol{\gamma}^k \end{pmatrix} = \begin{pmatrix} \nabla f(\boldsymbol{x}^k) + \boldsymbol{\gamma}^{k\top} \nabla g(\boldsymbol{x}^k) \\ \boldsymbol{g}(\boldsymbol{x}^k) \end{pmatrix},\tag{2.8}
$$

where is an approximation of ∇ 2 L ( , ). 4 In the next paragraph we will recall that if an SQP algorithm is initialized close enough to a local minimizer ( ★, ★), the pair (, ) will converge to ( ★, ★) at locally quadratic rate provided that ∇ 2 L (, ) is positive definite and LICQ holds at ( ★, ★). SQP can be extended to problems with inequalities—we refer to [NW06, Chap. 18] for details.

#### **Local convergence of Newton-type methods**

Next, we give a proof of local convergence and convergence rates for the above Newton iteration (2.8). We consider *approximations* of the KKT matrix instead of exact ones. These approximations are important for example if one

<sup>4</sup> Positive definiteness of is essential for local convergence of Newton-type methods. These methods differ in the way they compute , cf. [NW06, Ch 3.4].

would like to use Hessian approximations ≈ ∇<sup>2</sup> L ( , ) instead of exact Hessians ∇ 2 L ( , ) when using Newton-type methods in context of SQP. When doing so, the local convergence rate depends also on the "quality" of these approximations. Local convergence of the iteration (2.8) is guaranteed by the following theorem.

*Theorem 4 (Local convergence of Newton-type methods [Die16, Thm 8.1])* Suppose that ( ★) = 0 holds. Moreover, suppose that is initialized closeenough to ★, i.e. with ∥ − ★∥ < 2(1−) where and are defined below. Then, the iterates generated by (2.8) converge to ★ and satisfy the convergence-rate estimate

$$\|q^{k+1} - q^{\star}\| \le \left(\kappa + \frac{\omega}{2} \|q^k - q^{\star}\|\right) \|q^k - q^{\star}\|,\tag{2.9}$$

if there exists an < ∞ and a < 1 such that

$$\left\| \left( M^k \right)^{-1} (M^k - \nabla F(q^k)) \right\| \le \kappa^k \le \kappa,\tag{2.10a}$$

$$\left\|(M^k)^{-1}\left(\nabla F(q) - \nabla F(q^k)\right)\right\| \le \frac{\omega}{2} \|q^k - q\|. \tag{2.10b}$$

Here the former inequality (2.10a) is called *compatibility condition* and the latter inequality (2.10b) is a *Lipschitz condition*. The proof of Theorem 4 is given in Appendix A.4 for the sake of completeness and because of its importance for the convergence analysis of ALADIN. Next, we provide examples for local convergence rates to provide some intuition on the effect of the two conditions (2.10a) and (2.10b) in Theorem 4.

*Example 1 (Convergence rates of Newton-type methods)* Assume that ∇ is Lipscitz continuous. Then, if we choose a non-singular , the Lipschitz condition (2.10b) is always satisfied.

For checking the compatibility condition (2.10a), consider three variants of Jacobian approximations : exact Jacobians with = ∇( ); damped Newton with = ∇( ) + for some > 0; and damped Newton with vanishing damping = ∇( ) + 10− for some > 0. For the first case, it immediately follows by (2.10a) that = 0. Thus, we have local quadratic convergence by (2.9) (cf. Section A.1 for a brief overview on convergence rate estimates).

Figure 2.1: Pure Newton (blue), damped-Newton (red) and damped-Newton with vanishing damping (yellow) for solving sin() = 0.

For damped Newton, condition (2.10a) reads ∥ (∇( ) + ) −1 (∇( ) + − ∇( )) ∥ = (∇( ) + ) −1 <sup>=</sup> , where one can make arbitrarily small by choosing a small if ∇( ) is non-singular. This yields linear convergence with arbitrary fast modulus .

For damped Newton with vanishing damping term we have := ∥ (∇( ) + 10− ) −1 ∥10− converging to zero for → ∞. Thus, if ∇( ) is nonsingular, one obtains superlinear convergence.

*Example 2 (Linear convergence is not necessarily slow)* Let us consider the problem of solving () = sin() = 0. Figure 2.1a shows the convergence of pure Newton's method (blue), damped Newton's method (red) with ∈ {0.8, 2} and damped-Newton with vanishing damping (yellow) and ∈ {0.3, 0.1}. One can clearly see that by choosing appropriate and , one can make convergence of the damped Newton variants almost as fast as the pure Newton's method. Thus, linear convergence does not necessarily mean that convergence is slow—it largely depends on the convergence modulus . This is especially relevant since damped-Newton methods for example typically converge at a quite fast linear rate if the damping parameter is sufficiently small. First-order methods such as ADMM, which we will introduce later, can be quite slow but both algorithms are guaranteed to converge linearly for certain problem classes.

*Remark 2 (Singular* ∇( ★)*)* Note that if ∇( ★) is singular, the convergence rate is in general linear only—also with exact Jacobians = ∇( ). In that case, the Lipschitz condition (2.10b) is violated, since ∥ (∇( ))−<sup>1</sup> ∥ → ∞ for

 → ★. Thus, the second last step of the proof of Theorem 4 (Appendix A.4) can not be performed. In this case we only know that by continuity of ∇, ∫ 1 0 ( ) −1 ∇ ★ + ( − ★) − ∇( ) is bounded yielding linear convergence. As an example consider the problem () = <sup>2</sup> = 0 with ★ = 0. Then the Newton iteration reads +<sup>1</sup> = − ( ) 2 /(2 ) = /2 and thus by taking norms ∥ +1−0∥ ≤ 1/2 ∥ −0∥. Notice that in this case, regularization procedures have to be considered as ∇ becomes increasingly ill conditioned close to ★ and thus factorization of becomes problematic.

*Remark 3 (Local vs. global convergence)* Note that one distinguishes between local and global convergence in optimization algorithms. Local convergence characterizes the convergence behavior of algorithms when initialized close to a local minimizer. In order to reach this region of local convergence, typically globalization routines ar required. These routines take aim at driving the iterates into the region of local convergence and should then typically "deactivate". However, note that although one might guess that if an algorithm is globally convergent, it converges to a local minimizer from any initial point. This is unfortunately not the case. Typically it is only guaranteed to converge to a stationary point of a so-called merit function, which can for example be the augmented Lagrangian. Unfortunately, not all stationary points of the augmented Lagrangian are local minimizers [Ber99, Chap 4.3]. Especially in case of non-convex constraints, the situation of linearly dependent constraint linearizations can occur contradicting the assumption of linear independence of the constraints sometimes made for the global convergence analysis of nonlinear programming algorithms [BT95, Sec 4]. We give a concrete example what kind of difficulties can occur especially in context of non-convex constraints in Section 2.3.3.

## **2.1.3 Multiplier methods**

Instead of working on the optimality conditions (2.2) directly and updating primal and dual variables in a simultaneous step, there is a different class of algorithms called *multiplier methods* [Ber82; Ber99]. There, the idea is to update primal and dual variables separately. This has advantages in several contexts—especially for developing distributed algorithms as in some cases the primal updates can be done in a distributed fashion under certain conditions. There are two basic approaches in this class of algorithms: one working on the Lagrangian function and a second one working on the so-called *augmented Lagrangian* function which adds some kind of regularization term to the objective function without changing the minimizer. The augmented Lagrangian for (2.1) (in case of equality constraints only) is defined as

$$\mathcal{L}^{\rho}(\mathbf{x}, \boldsymbol{\gamma}) := f(\boldsymbol{\alpha}) + \boldsymbol{\gamma}^{\top} \boldsymbol{g}(\boldsymbol{\alpha}) + \frac{\rho}{2} \left\| \boldsymbol{g}(\boldsymbol{\alpha}) \right\|\_{2}^{2}.$$

Note that for = 0, one recovers the standard Lagrangian. The method of multipliers now proceeds in two simple steps: First, we estimate a certain Lagrange multiplier . Then we minimize the Lagrangian/augmented Lagrangian with respect to in a separate step. The resulting algorithm, summarized in Algorithm 1, is guaranteed to converge to a minimizer of (2.1) if the multiplier estimates are close enough to ★. If we would estimate the multipliers correctly, multiplier methods converge in one step if is large enough, cf. [Ber99, Sec 3.2.1]. Moreover, multiplier methods converge to a *global* minimizer of (2.1) for → ∞ [Ber99, Prop 4.2.1]. However, note that although this result is very appealing from an theoretical point of view, in practice driving to ∞ is problematic due to the corresponding ill-conditioning of ∇ <sup>2</sup>L (, ) and the necessity to minimize L globally which is hard to achieve by practical solvers. If we choose the multiplier update to

## Algorithm 1. General multiplier method

)

Initialization: 0 , Repeat: 1) +<sup>1</sup> = argmin L (, 2) +<sup>1</sup> ← 

$$
\gamma^{k+1} = \gamma^k + c^k g(\alpha^k), \tag{2.11}
$$

the method is called *method of multipliers*.

## **2.2 Distributed optimization algorithms**

Now let us consider optimization problems with special structure leading to *distributed optimization algorithms*. Note that in the literature various forms of such structures are employed [ND11]. In many cases, they can be converted into each other. We give a brief overview on these structures in Appendix A.2. Here, we consider problems in so-called *affinely-coupled separable form*

$$\min\_{\mathbf{x}\_{i},\mathbf{x}\_{i},\ldots,\mathbf{x}\_{\mathcal{R}}} \sum\_{i \in \mathcal{R}} f\_{i}(\mathbf{x}\_{i}) \tag{2.12a}$$

$$\text{subject to} \quad \text{ } \quad \text{g}\_i(\mathbf{x}\_i) = \mathbf{0}, \text{ } \quad \forall i \in \mathcal{R}, \qquad \text{ (2.12b)}$$

$$h\_i(\mathbf{x}\_i) \le 0,\qquad\qquad\forall i \in \mathcal{R},\qquad(2.12c)$$

$$\sum\_{i \in \mathcal{R}} A\_i x\_i = b,\tag{2.12d}$$

where R = {1, . . . , } is the set of *agents* or *subsystems*, to each of which we assign an objective function : R → R, equality constraints : R → R and inequality constraints ℎ : R → R ℎ . We assume that all , and ℎ are sufficiently often continuously differentiable. The agents are coupled trough (2.12d) which is called *consensus* or *coupling* constraint. Note that many practical problems in form of (2.1) can be reformulated in form of (2.12) by introducing auxilliary variables, cf. Appendix A.2. Next, we recall one of the earliest distributed optimization methods called *dual decomposition* [Eve63].

## **2.2.1 Dual decomposition**

To simplify notation, we use the indicator function <sup>X</sup> : → R ∪ {∞},

$$\chi \mapsto \begin{cases} 0 & \text{if} \\ \infty & \text{else} \end{cases} \quad \chi \in \mathcal{X}$$

in the subsequent analysis. With the indicator function we can reformulate (2.12) as

$$\min\_{\boldsymbol{x}\_1,\ldots,\boldsymbol{x}\_R} \sum\_{\boldsymbol{i}\in\mathcal{R}} \tilde{f}\_{\boldsymbol{i}}(\boldsymbol{x}\_{\boldsymbol{i}}) \tag{2.13a}$$

$$\text{subject to} \quad \sum\_{i \in \mathcal{R}} A\_i \mathbf{x}\_i = b,\tag{2.13b}$$

with ˜ := +X and X := { ∈ R | () = 0, ℎ() ≤ 0}. As the name indicates, dual decomposition is closely related to solving the *dual problem* to (2.13). Let us assume that the assumptions made in Theorem 3 (convexity, Slater's condition) hold true for problem (2.13). Then we can replace (2.13) by its dual problem

$$\sup\_{\lambda} q(\lambda) = \sup\_{\lambda} \min\_{\mathbf{x} \in \mathcal{X}} \sum\_{i \in \mathcal{R}} \tilde{f}\_i(\mathbf{x}\_i) + \lambda^{\top} \left(\sum\_{i \in \mathcal{R}} A\_i \mathbf{x}\_i - b\right). \tag{2.14}$$

One very simple way to solve (2.14) is to apply a gradient method. If is strictly convex and X is compact, one can show that is continuously differentiable and its gradient is given by

$$\nabla q(\boldsymbol{\lambda}) = \sum\_{i \in \mathcal{R}} A\_i \boldsymbol{\chi}\_i(\boldsymbol{\lambda}) - b\_\* $$

where () is computed by evaluating the dual function [Ber99, Prop 6.1.1]. The gradient iteration for maximization of (2.14) then is

$$
\lambda^{k+1} = \lambda^k + \alpha \,\nabla q(\lambda^k), \tag{2.15}
$$

where ∈ (0 1) is a stepsize parameter. Observe that is separable, i.e we can write as () = Í ∈R () − <sup>⊤</sup>, where () = min ∈X () + <sup>⊤</sup> , so for fixed , can be evaluated *in parallel*. The resulting gradient method is called *dual decomposition* and summarized in Algorithm 2, where L( , ) := () + <sup>⊤</sup> . Note that dual decomposition is very effective when the minimization in the evaluation of the dual can be replaced by an explicit expression (e.g. when solving strictly convex QPs).

An advantage of dual decomposition is that it is very simple to implement. Moreover, if the s are sparse (i.e. there are many s with zero rows), it


is possible to evaluate step 2) of Algorithm 2 in a decentralized fashion as computing the components of involves summands of the subsystems with nonzero rows in only. On the other hand, dual decomposition requires strict convexity of and compactness of X, or strong convexity of as otherwise the minimizer in step 1) might not exist.<sup>5</sup> In turn this means that for general convex problems, dual decomposition might fail. Moreover, as dual decomposition is a gradient method with a fixed stepsize, convergence is in general sublinear only [Nes18, Thm 2.1.14].

## **2.2.2 Alternating Direction Method of Multipliers**

To overcome its weaknesses (in particular the restriction to strictly/strongly convex problems), dual decomposition can be combined with the method of multipliers from Section 2.1.3 yielding the *Alternating Directions Method of Multipliers (ADMM)*. Herein, the idea is to use the *augmented Lagrangian* (instead of the Lagrangian) adding an additional convexification term rendering this algorithm applicable to convex but not necessarily strictly convex problems. Unfortunately the augmentation term in the first step of the method of multipliers jeopardizes the separability of L hindering evaluation of the dual function in parallel.<sup>6</sup> To overcome this issue, one usually constructs an augmented problem to (2.13a) introducing variable copies of and then executes

<sup>5</sup> This can be the case even for very simple problems. For example, consider a problem with affine cost. Then the problem in step 1) might be unbounded from below and dual decomposition would fail.

<sup>6</sup> As an alternative, one can also see ADMM as a dual ascent method, where one solves a modified but equivalent version of (2.13a) with a penalization term /2∥ Í ∈R −∥ in the objective [Boy+11].

the first step of the method of multipliers (Section 2.1.3) in an "alternating" or coordinate descent fashion recovering partial separability. We will now briefly recall ADMM for the augmented problem and then show how our problem (2.13a) fits into this setting.

The augmented problem in so called two-block form reads

$$\min\_{\mathbf{x}, z} f(\mathbf{x}) + \mathbf{g}(z) \tag{2.16}$$

$$\text{subject to} \ C\mathbf{x} + D\boldsymbol{\underline{z}} = c.\tag{2.17}$$

For applying the method of multipliers, we consider the augmented Lagrangian

$$\mathcal{L}^{\rho}(\mathbf{x}, z, \boldsymbol{\lambda}) = f(\mathbf{x}) + \mathbf{g}(z) + \boldsymbol{\lambda}^{\top}(C\mathbf{x} + Dz - c) + \frac{\rho}{2} \left\| \mathbf{C}\mathbf{x} + Dz - c \right\|\_{2}^{2} . \tag{2.18}$$

We apply the method of multipliers (Algorithm 1) to problem (2.16) with one key difference: instead of performing a joint minimization with respect to and in step 1), we perform the minimization with respect to and with respect to in two separate steps. This yields ADMM shown in Algorithm 3.

Algorithm 3. Alternating Direction of Multipliers Method (ADMM) Initialization: 0 , <sup>0</sup> , Repeat: 1) +<sup>1</sup> <sup>=</sup> argmin <sup>L</sup> ( , , ) 2) +<sup>1</sup> <sup>=</sup> argmin <sup>L</sup> ( +1 , , ) 3) +<sup>1</sup> = + (+<sup>1</sup> + +<sup>1</sup> − )

Now consider problem (2.13a). Let us write (2.13a) in two-block form by introducing variable copies = for all ∈ R yielding

$$\min\_{\boldsymbol{x}\_{1},\boldsymbol{x}\_{1},\ldots,\boldsymbol{x}\_{\mathcal{R}},\boldsymbol{z}\_{1},\ldots,\boldsymbol{z}\_{\mathcal{R}}} \sum\_{i\in\mathcal{R}} \tilde{f}\_{i}(\boldsymbol{x}\_{i}) + \iota\_{\mathcal{C}}(\boldsymbol{z})\tag{2.19a}$$

$$\text{subject to} \quad A\_i(x\_i - z\_i) = 0 \qquad \forall i \in \mathcal{R}, \tag{2.19b}$$

where C := { ∈ R | Í ∈R = }. Then, the augmented Lagrangian (2.18) reads

$$\mathcal{L}^{\rho}(\mathbf{x}, z, \boldsymbol{\lambda}) = \sum\_{i \in \mathcal{R}} \tilde{f}\_i(\mathbf{x}\_i) + \iota\_\mathcal{C}(z) + \boldsymbol{\lambda}\_i^\top A\_i(\mathbf{x}\_i - z\_i) + \frac{\rho}{2} \|A\_i(\mathbf{x}\_i - z\_i)\|\_2^2,\tag{2.20}$$

where <sup>⊤</sup> = ( ⊤ 1 , . . . , ⊤ ). Now it becomes clear why introducing additional auxiliary variables and alternating minimization helps: if one fixes in (2.20) and minimize with respect to , the minimizations with respect to are independent of each other and can be executed in parallel. This yields a parallel version of ADMM for problem (2.13a) shown in Algorithm 4. Note that the


$$\text{(2.3)}\quad \lambda\_i^{k+1} = \lambda\_i^k + \rho A\_i (x\_i^{k+1} - z\_i^{k+1}), \quad \qquad i \in \mathcal{R} \quad \text{(parallel)}$$

"coordination step" minimizing L with respect to is a convex QP which requires central computation. However, this step can be simplified to a simple averaging step between subsystems under certain conditions [Boy+11]. This renders ADMM a *decentralized* algorithm which can be executed purely based on neighbor-to-neighbor communication. On the other hand, the multiplier update step is also executable in parallel. Note that there are further specialized variants of ADMM mostly differing in the underlying problem structures. Many of them can be reformulated in two-block form, we provide several examples in Appendix A.2.

Due to the augmentation term, if either X is bounded or has full row rank, ADMM is guaranteed to converge also for non-strictly convex problems [BT89, Prop 4.2] [Boy+11, App A]. The convergence rate, however, depends on strict convexity of ˜ and does *not* directly carry over from the method of multipliers.<sup>7</sup> If ˜ is convex, convergence of ADMM is generally sublinear [HY15]. Under certain conditions such as strong convexity [DY16; GB16; Shi+14] or a combination of other conditions including strict convexity [HL17], linear convergence can be achieved.<sup>8</sup>

*Remark 4 (ADMM variants)* There are various variants and derivations of ADMM differing in the order of the updates as well as in the problem setup. Moreover, multi-block variants of ADMM exist, where more than two blocks of variables are considered [HL17; LMZ15]. The overview paper [Boy+11] provides an excellent overview on the most common form of ADMM.

To illustrate that distributed optimization methods can be very useful in computer science and especially in machine learning, we present an application example of ADMM applied to support-vector machines used for classification problems.

*Example 3 (ADMM for machine learning: support-vector machines)* A standard method for classification are Support-Vector Machines (SVMs) aiming at separating sets of points by a hyperplane. For SVMs, distributed optimization might be interesting primarily in case of very large data sets. One reason is computational speedup, but also data privacy is a reasons why one would like to compute on multiple machines in case of sensitive data [FCG10]. An linear SVM (for two classes) can be cast as the optimization problem

$$\min\_{\mathbf{w}, b} \sum\_{j \in \mathcal{Y}} \left( 1 - \sigma\_j(\mathbf{w}^\top \mathbf{y}\_j - b) \right)\_+ + \frac{\delta}{2} \|\mathbf{w}\|\_2^2,\tag{2.21}$$

where ()<sup>+</sup> := max(0, ), ∈ R are given data points collected in the set Y = {1, . . . , } [CL11]. These points are classified into two groups labeled with ∈ {−1, 1}. The optimization variables ∈ R 2 and ∈ R define the affine subspace separating the two groups of points.

<sup>7</sup> One explanation for this is the fact that the convergence rate estimates of the method of multipliers depends on the *exact* minimization of L jointly with respect to *and* . In ADMM, however, one executes a form of block-coordinate descent yielding ane inexact minimization.

<sup>8</sup> Moreover, ADMM is convergent for special classes of non-convex problems, cf. [WYZ19; HLR16]. However, for general non-convex problems, there is—to the best of the author's knowledge—no convergence guarantee for convergence.

We would like to distribute computation among R = {1, . . . , } processors. For doing so, we partition the data points into groups Y with Ð =1,..., Y = Y. Hence, problem (2.21) can be written as

$$\begin{aligned} \min\_{\{\boldsymbol{w}\_{i}\}\_{i\in\mathcal{R},\{b\_{i}\}\_{i\in\mathcal{R},\bar{a}\_{i},\bar{b}}} & \sum\_{i\in\mathcal{R}} \sum\_{j\in\mathcal{Y}\_{i}} \left(1-\sigma\_{j}(\boldsymbol{w}\_{i}^{\top}\mathbf{y}\_{j}-b\_{i})\right)\_{+}+\frac{\delta}{2} \|\boldsymbol{w}\_{i}\|\_{2}^{2},\\ \text{subject to} & \boldsymbol{w}\_{i}=\boldsymbol{\bar{w}}, \quad b\_{i}=\boldsymbol{\bar{b}}, \qquad i=1,\ldots,R \end{aligned}$$

which is in form of (2.12).<sup>9</sup> The augmented Lagrangian then reads

$$\begin{split} \mathcal{L}^{\rho}(\{\boldsymbol{w}\_{i}\}\_{i \in \mathcal{R}, \{b\_{i}\}\_{i \in \mathcal{R}, \boldsymbol{w}\_{i} \bar{\boldsymbol{\theta}}\}} &= \sum\_{i \in \mathcal{R}} \sum\_{j \in \mathcal{N}\_{i}} \left(1 - \sigma\_{j}(\boldsymbol{w}\_{i}^{\top}\boldsymbol{y}\_{j} - b\_{i})\right)\_{+} + \frac{\delta}{2} \|\boldsymbol{w}\_{i}\|\_{2}^{2} \\ &+ \lambda\_{\boldsymbol{w}i}^{\top}(\boldsymbol{w}\_{i} - \bar{\boldsymbol{w}}) + \lambda\_{\boldsymbol{b}i}(b\_{i} - \bar{\boldsymbol{b}}) + \frac{\rho}{2} \|\boldsymbol{w}\_{i} - \bar{\boldsymbol{w}}\|\_{2}^{2} + \frac{\rho}{2} \|\boldsymbol{b}\_{i} - \bar{\boldsymbol{b}}\|\_{2}^{2}. \end{split}$$

ADMM leads to the iterates

$$\begin{split} (\boldsymbol{w}\_{i}^{k+1\top},\boldsymbol{b}\_{i}^{k+1}) &= \arg\min\_{\boldsymbol{w}\_{i},\boldsymbol{b}\_{i}} \sum\_{j\in\mathcal{Y}\_{i}} \left( 1-\sigma\_{j}(\boldsymbol{w}\_{i}^{\top}\mathbf{y}\_{j}-\boldsymbol{b}\_{i}) \right)\_{+} + \frac{\delta}{2} \||\boldsymbol{w}\_{i}\|\_{2}^{2} + \lambda\_{\boldsymbol{w}i}^{\top}(\boldsymbol{w}\_{i}-\bar{\boldsymbol{w}}) \\ &+ \lambda\_{\boldsymbol{b}i}(\boldsymbol{b}\_{i}-\bar{\boldsymbol{b}}) + \frac{\rho}{2} \left( \|\boldsymbol{w}\_{i}-\bar{\boldsymbol{w}}\|\_{2}^{2} + \|\boldsymbol{b}\_{i}-\bar{\boldsymbol{b}}\|\_{2}^{2} \right), \quad i\in\mathcal{R}. \\ (\bar{\boldsymbol{w}}^{k+1\top},\bar{\boldsymbol{b}}^{k+1}) &= \frac{1}{|\mathcal{R}|} \sum\_{i\in\mathcal{R}} (\boldsymbol{w}\_{i}^{k+1\top},\boldsymbol{b}\_{i}^{k+1}), \\ (\lambda\_{\boldsymbol{w}i}^{k+1\top},\lambda\_{\boldsymbol{b}i}^{k+1}) &= (\lambda\_{\boldsymbol{w}i}^{k\top},\lambda\_{\boldsymbol{b}i}^{k}) + \rho\left( (\bar{\boldsymbol{w}}\_{i}^{k+1\top},\bar{\boldsymbol{b}}\_{i}^{k+1}) - (\boldsymbol{w}\_{i}^{k+1\top},\boldsymbol{b}\_{i}^{k+1}) \right), \qquad i\in\mathcal{R}. \end{split}$$

Note that the first (and by far most expensive) step of the above iterates can be implemented in parallel.<sup>10</sup> Note that this first step essentially solves an independent SVM problem only considering the data only from Y with two additional summands in the objective function repsresenting information form the other data sets by the global averages ¯ and ¯.

<sup>9</sup> Except for being non-smooth. However, ADMM can in general also be applied to non-smooth problems as the one considered here [Boy+11].

<sup>10</sup> Note that the summands involving the multipliers and in the second step cancel out as it can be shown they are zero after the first iteration [Boy+11, Chap 7]. Moreover, observe that the parallel step can be implemented by a QP solver.

Figure 2.2: Convergence of the ADMM-based SVM.

Figure 2.3: Centralized and distributed SVM result.

This implementation is especially useful for large datasets: in this case, the data can be distributed equally among all processors by choosing the same carnality for each Y . This way, each subsystem needs approximately the same time to solve its local SVM. By doing so, one can (theoretically) keep the execution time constant when increasing the number of processors simultaneously with the increase in the number of data points. However, keep in mind that such decomposition makes sense only for large problems and when using multiple processors, or in case privacy is of concern. For small datasets standard centralized convex QP algorithms (e.g. interior point methods) will usually outperform ADMM.

To test the ADMM-based SVM, we consider a set of 2, 000 data points distributed equally among = 10 processors, where = ( (¯ +1⊤ , ¯ +<sup>1</sup> ) − (¯ ⊤ , ¯ )) is the *dual residual*, i.e. the degree of stationarity in the KKT conditions (2.2a). We use = = 1 as ADMM and SVM parameters, cf. [Boy+11, Chap 3.3]. The datasets are partitioned in the worst-possible way to exclude good solutions by chance: we choose 5 subsystems with all positive datapoints and 5 subsystems with all negative datapoints. Figure 2.2 shows the convergence of the ADMM-based SVM. One can see that ADMM converges quite slowly and only to a limited accuracy. However, Figure 2.2 depicts the resulting hyperplane computed by the centralized and the distributed SVM. The two hyperplanes are almost indistinguishable. This is typical for ADMM: it converges to medium accuracy in a hand-full iterations and then slows down. However, as for many applications this level of accuracy is sufficient especially in context of machine learning—ADMM seems to be an excellent choice for these applications.

## **2.2.3 Basic ALADIN**

A fundamental limitation of dual decomposition and ADMM are their convergence guarantees only for convex or strictly convex problems. One of the very few algorithms for *non-convex* problems is the Augemented Lagrangian Alternating Direction Inexact Newton (ALADIN) algorithm [HFD16]. A main advantage of ALADIN is its local quadratic convergence rate and local convergence guarantees also for non-convex problems.

In contrast to ADMM, ALADIN does *not* introduce auxiliary variables into the problem formulation. It directly deals with the partial augmented Lagrangian of (2.12) with respect to (2.12d)

$$\mathcal{L}^{\rho}(\mathbf{x}, \boldsymbol{\lambda}) = \sum\_{i \in \mathcal{R}} f\_i(\mathbf{x}\_i) + \iota\_{\mathcal{X}\_i} + \boldsymbol{\lambda}^{\top} \left( \sum\_{i \in \mathcal{R}} A\_i \mathbf{x}\_i - \boldsymbol{b} \right) + \frac{\rho}{2} \left\| \sum\_{i \in \mathcal{R}} A\_i \mathbf{x}\_i - \boldsymbol{b} \right\|\_{2}^{2} . \tag{2.22}$$

Consider the method of multipliers, but instead of applying a full minimization of L with respect to we apply only one equality-constrained SQP step yielding

$$\begin{split} \min\_{\mathbf{x}} \sum\_{i \in \mathcal{R}} \frac{1}{2} \Delta \mathbf{x}\_i^\top B\_i^k \Delta \mathbf{x}\_i + \nabla f\_i^\top (\mathbf{x}\_i) \Delta \mathbf{x}\_i + \boldsymbol{\lambda}^{k\top} \left( \sum\_{i \in \mathcal{R}} A\_i (\mathbf{x}\_i + \Delta \mathbf{x}\_i) - b \right) \\ + \frac{\rho}{2} \left\| \sum\_{i \in \mathcal{R}} A\_i (\mathbf{x}\_i + \Delta \mathbf{x}\_i) - b \right\|\_2^2 \quad (2.23) \\ \text{subject to } \quad \tilde{g}\_i (\mathbf{x}\_i^k) + \nabla \tilde{g}\_i (\mathbf{x}\_i^k) \Delta \mathbf{x}\_i = \mathbf{0}, \qquad \forall i \in \mathcal{R}, \end{split} \tag{2.23}$$

where we combine equality constraints and active inequality constraints in ˜() <sup>⊤</sup> = () <sup>⊤</sup>, (ℎ() <sup>⊤</sup>)A ( ) . Here, the matrix is a positive definite approximation of the Hessian of the full Lagrangian, i.e. ≈ ∇<sup>2</sup> ( ( ) + ⊤ ( ) + ⊤ ℎ( ) . For the multiplier update we apply the standard dual ascent step from the method of multipliers (2.11)

$$
\lambda^{k+1} = \lambda^k + \alpha^k \left( \sum\_{i \in \mathcal{R}} A\_i \boldsymbol{\alpha}\_i^{k+1} - b \right).
$$

In principle, one could apply this algorithm now to equality-constrained problems. This would yield a very effective algorithm since if is large enough, the QP (2.23) becomes strongly convex and thus it can be replaced by solving the KKT conditions (2.2) which is a linear system of equations. However, if inequality constraints are present, the question arises how to obtain the active set A ( ). An alternative is to consider inequality constraints in (2.23), but this would make (2.23) substantially more difficult to solve, since the KKT conditions (2.2) also entail inequality constraints in this case.

ALADIN uses a different approach: it introduces a local NLP step very similar to step 1) of ADMM (Algorithm 4). This step reads

$$\min\_{\mathbf{x}\_i} \sum\_{i \in \mathcal{R}} \tilde{f}\_i(\mathbf{x}\_i) + \lambda^{k \top} A\_i \mathbf{x}\_i + \frac{\nu}{2} \|\mathbf{x}\_i - \mathbf{z}\_i^k\|\_{\Sigma\_i}^2,\tag{2.24}$$

where Σ is a (usually diagonal) positive definite scaling matrix and where we introduce auxiliary variables serving as a second iterate in ALADIN. As an active set for (2.23), one can use the active set from the minimization of problem (2.24). This has additionally the charm that, if derivative-based numerical solvers are used to solve (2.24), one can reuse these derivatives for setting up (2.23). Note that (2.24) is separable, i.e. the minimization can be parallelized. Combining the above yields the ALADIN algorithm summarized in Algorithm 5. Note that ˜( ) = 0 in Algorithm 5 as feasibility is ensured in step 1) and that ∇˜ is the block-diagonal concatenation of all ∇ .

Algorithm 5. Basic ALADIN Initialization: 0 , <sup>0</sup> , Σ ≻ 0 for all ∈ R, , Repeat:

1) Solve for all ∈ R

$$\|x\_i^k = \underset{\boldsymbol{x}\_i \in \mathcal{X}\_i}{\text{argmin}} \ f\_l(\boldsymbol{x}\_i) + \boldsymbol{\lambda}^{k \top} \boldsymbol{A}\_{\boldsymbol{i}} \boldsymbol{x}\_{\boldsymbol{i}} + \frac{\nu}{2} \|\boldsymbol{x}\_{\boldsymbol{i}} - \boldsymbol{z}\_{\boldsymbol{i}}^k\|\_{\boldsymbol{\Sigma}}^2,\tag{\text{parallel}}$$


$$\begin{split} \Delta \mathbf{x}^{k} &= \underset{\Delta \mathbf{x}}{\operatorname{argmin}} \sum\_{i \in \mathcal{R}} \frac{1}{2} \Delta \mathbf{x}\_{i}^{\top} B\_{i}^{k} \Delta \mathbf{x}\_{i} + \nabla f\_{i}^{\top} (\mathbf{x}\_{i}^{k}) \Delta \mathbf{x}\_{i} \\ &+ \boldsymbol{\lambda}^{k \top} \left( \sum\_{i \in \mathcal{R}} \mathcal{A}\_{i} (\mathbf{x}\_{i}^{k} + \Delta \mathbf{x}\_{i}) - b \right) + \frac{\rho}{2} \| \sum\_{i \in \mathcal{R}} \mathcal{A}\_{i} (\mathbf{x}\_{i}^{k} + \Delta \mathbf{x}\_{i}) - b \|\_{2}^{2} \\ \text{subject to} \quad \nabla \tilde{\mathbf{g}} (\mathbf{x}^{k}) \Delta \mathbf{x} &= \mathbf{0}. \end{split}$$
 
$$\text{(4)} \quad \text{Set } \boldsymbol{z}\_{i}^{k+1} = \boldsymbol{x}\_{i}^{k} + \Delta \mathbf{x}\_{i}^{k} \text{ and } \boldsymbol{\lambda}^{k+1} = \boldsymbol{\lambda}^{k} + \rho \left( \sum\_{i \in \mathcal{R}} \mathcal{A}\_{i} \boldsymbol{x}\_{i} - b \right). \tag{gamma} \quad \text{(parallel)}. \tag{gamma}$$

*Remark 5 (Slack variables for numerical stability)* In [HFD16], the QP step (2.23) is replaced by the equivalent formulation

$$\begin{aligned} \min\_{\boldsymbol{\Delta x}, \boldsymbol{d}} & \sum\_{\boldsymbol{i} \in \mathcal{R}} \frac{1}{2} \Delta \boldsymbol{x}\_{\boldsymbol{i}}^{\top} B\_{\boldsymbol{i}}^{k} \Delta \boldsymbol{x}\_{\boldsymbol{i}} + \nabla f\_{\boldsymbol{i}}^{\top} (\boldsymbol{x}\_{\boldsymbol{i}}^{k}) \Delta \boldsymbol{x}\_{\boldsymbol{i}} + \boldsymbol{\lambda}^{k\top} \boldsymbol{d} + \frac{\mu}{2} \|\boldsymbol{d}\|\_{2}^{2} \\ \text{subject to} & \quad \tilde{\boldsymbol{g}}\_{\boldsymbol{i}} (\boldsymbol{x}\_{\boldsymbol{i}}^{k}) + \nabla \tilde{g}\_{\boldsymbol{i}} (\boldsymbol{x}\_{\boldsymbol{i}}^{k}) \Delta \boldsymbol{x}\_{\boldsymbol{i}} = \boldsymbol{0} \qquad \forall \boldsymbol{i} \in \mathcal{R}, \\ & \quad \sum\_{\boldsymbol{i} \in \mathcal{R}} A\_{\boldsymbol{i}} (\boldsymbol{x}\_{\boldsymbol{i}}^{k} + \Delta \boldsymbol{x}\_{\boldsymbol{i}}) - \boldsymbol{b} = \boldsymbol{d} \qquad \mid \boldsymbol{\lambda}^{\text{QP}}, \end{aligned} \tag{2.25}$$

where ∈ R are slack variables. These slack variables are important for numerical stability of ALADIN in implementations. For the understanding of ALADIN, this formulation does not add much, but due to their practical importance we will consider this formulation in the following.

*Remark 6 (Computation of* +1 *)* Note that +1 coincides with QP from (2.25). This can be seen from the optimality conditions to (2.25), QP = + = + ( Í ∈R − ).

*Remark 7 (Globalization of ALADIN)* Note that for global convergence, i.e. convergence from remote starting points, a globalization rountine is necessary. This means that step 3) of ALADIN is replaced by

$$z\_i^{k+1} = z\_i^k + \alpha\_1 (\mathbf{x}\_i^k - z\_i^k) + \alpha\_2 \Delta \mathbf{x}\_i^k, \quad \lambda^{k+1} = \lambda^k + \alpha\_3 (\lambda^{\text{QP}} - \lambda^k),$$

where the step-sizes 1, <sup>2</sup> and <sup>3</sup> are determined by these routines. The setting <sup>1</sup> = <sup>2</sup> = <sup>3</sup> = 1 recovers step 3) of Algorithm 5, which is called the full step variant of ALADIN. One globalization routine is given in [HFD16], however it is highly centralized and computationally intense. In the present work we focus on local convergence, i.e. on convergence for initial guesses close enough to a local minimizer where a globalization routine can be omitted.

## **2.3 What's wrong with constraints?**

In this subsection we will briefly comment on difficulties, which can arise in the application of distributed optimization algorithms to *constrained* problems. Constraints are important in many cases—particularly for applications in power systems and control which we have in mind.

## **2.3.1 Alternating projections in ADMM and ALADIN**

Constrained problems might lead to difficulties in alternating direction methods such as ADMM. As outlined before, the core idea of ADMM is to minimize the augmented Lagrangian L with respect to two distinct variable blocks and in an *alternating fashion* to obtain separability. The name "alternating direction" already suggests, ADMM is a kind of alternating projection method between the two types of constraints—local constraints and consensus constraints. This can be seen by the two different constraint sets X and C in the minimization steps of Algorithm 4. These alternating projections might lead to very slow convergence of ADMM as we will see next.

### **An example**

Let us consider a convex feasibility problem (i.e. ≡ 0) with a local constraint set X

$$\min\_{\mathbf{x}} \iota\_{\mathcal{X}}(\mathbf{x}) \qquad \text{subject to} \qquad A\mathbf{x} = b. \tag{2.26}$$

Then ADMM (Algorithm 3 simplifies to the following alternating projection procedure

$$\begin{aligned} \boldsymbol{x}^{k+1} &= \boldsymbol{\Pi}\_{\mathcal{X}} (\boldsymbol{z}^k - \boldsymbol{\mu}^k) \\ \boldsymbol{z}^{k+1} &= \boldsymbol{\Pi}\_{\mathcal{C}} (\boldsymbol{x}^{k+1} + \boldsymbol{\mu}^k) \\ \boldsymbol{\mu}^{k+1} &= \boldsymbol{\mu}^k + \boldsymbol{x}^{k+1} - \boldsymbol{z}^{k+1}, \end{aligned}$$

where Π<sup>C</sup> : R → C denotes the orthogonal projection on the "consensus constraint set C = { ∈ R | = }, see [Boy+11, Sec 3.1.1]. Let us consider the 2-dimensional example

$$\mathcal{C} = \{ \boldsymbol{\chi} \in \mathbb{R}^2 \mid \boldsymbol{\chi}\_2 = 0 \} \quad \text{and} \quad \mathcal{X} = \{ \boldsymbol{\chi} \in \mathbb{R}^2 \mid \|\boldsymbol{\chi} - \bar{\boldsymbol{\chi}}\|\_2^2 \le r^2 \},$$

with ¯ = 10 50.99<sup>⊤</sup> and = 51. The set of optimal solutions is given by X★ = X ∩ C = { ∈ R 2 | ( 0) <sup>⊤</sup>, ∈ [8.99 11.01]}.

Figure 2.4 depicts the two feasible sets X and C considered in step 1) and step 2) of ADMM with the corresponding iterates starting from the initial point <sup>0</sup> = (0, 0.5) ⊤ and <sup>0</sup> = (0, 0) ⊤. One can clearly see the alternating projections in both algorithms: the iterates { } stay feasible in X and the iterates { } stay feasible in C—thus "alternating" between these two sets.<sup>11</sup>

<sup>11</sup> For ALADIN, the iterates { } slightly deviate from C as the consensus constraints (2.12d) are considered in the penalization term of L .

Figure 2.4: ALADIN and ADMM iterates for problem (2.26).

For ADMM convergence is slow since pure projection between both sets makes almost no progress toward their intersection. For ALADIN, the situation is different since here also curvature information of the constraint set X (i.e. the constraint linearizations ∇˜( ) in step 2) of Algorithm 5) is considered driving the iterates much faster to X ∩ C. This can be seen from the step directions ( − ) being almost tangential to the boundary of X.

*Remark 8 (Description of feasible sets)* Note that considering constraint linearizations in ALADIN comes at the cost of requiring the feasible set X to be described by differentiable constraint functions and ℎ, i.e. X = { ∈ R | () = 0, ℎ() ≤ 0} as in (2.1b). This can be seen as a restriction to the class of distributed problems, to which ALADIN is applicable. ADMM on the other hand does not have such a restriction and is thus able to handle a general convex constraint set. In view of the fact that the majority of practical applications *has* constraint sets described by differentiable functions, not considering this information means to accept a unnecessary poor performance.

Figure 2.5: Infeasibility in C over the iteration index.

## **2.3.2 Integrator wind-up in ADMM**

Apart from slow convergence from alternating projections, there seems to be a second effect in ADMM slowing down its performance. This effect is related to the multiplier update in step 3) of ADMM (Algorithm 4) acting as a integrator.

Figure 2.5 shows the infeasibility of the iterates { } in terms of the constraint set C, i.e. | 2 | over the iteration index . Here one can see that not only ALADIN converges much faster than ADMM in terms of total iterations, but also that the accuracy in the constraint violation one can reach with ADMM seems to be limited to a level of around 10−<sup>2</sup> . This is surprising, as ADMM is shown to converge monotonically [HY15].

Figure 2.6 also shows the the infeasibility of the iterates { } in terms of the constraint set C but for a much larger number of iterations and additionally also the scaled dual variables . Here, one can observe a very surprising effect: whereas in the primal space, there is no visible progress for more than 600 iterations, the dual variables change slightly. After about 700 iterations, the primal variables converge suddenly and to a very high accuracy.

By having a look at the dual variables, one can see why: they integrate to a value of about 6 in the first few iterations and then converge only very slowly to their optimal value of zero, which seems to hinder the primal values from convergence.

One might ask why this is the case. From Figure 2.4 one can see that ADMM performs quite large alternating projection steps, which drive the dual variable to a high value in the first iterations, since the multiplier update rule "integrates" the distance ( +<sup>1</sup> − +1 ). Once the iterates reached the feasible set, the

Figure 2.6: Infeasibility in C and scaled dual variables.

alternating behavior continues although the iterates are already feasible. As in this new area of oscillation (Figure 2.4, second plot), the distance ( +1− +1 ) is quite small, it takes several hundreds of iterations to dismount the before integrated value of the multipliers. Once they reached their optimal value, ADMM suddenly converges. Observe that for this example, the behavior is independent of , since does not appear in the ADMM iterations. Thus, this effect can *not* be overcome by tuning.

To the best of our knowledge, this effect is not investigated in the literature in detail so far. Moreover, it seems to be one reason for very slow convergence of ADMM for some OPF problems (cf. Section 5.4.1).

## **2.3.3 Non-convex constraints**

Many distributed algorithms require the feasible set to be convex. If one nonetheless applies these algorithms to problems with non-convex constraints, the convergence is not guaranteed. Thus, the question arises, whether there is a particular difficulty for *distributed* algorithms to handle non-convex constraints. Next, we will show that this is not the case. More specifically, we will show an example, where *all here considered algorithms diverge*—*including centralized methods* such as the method of multipliers and SQP diverge contradicting the common wisdom that they can achieve convergence to a local minimizer from an *arbitrary* initialization by considering suitable globalization routines.

With this, we would like to emphasized that there is an *inherent difficulty* associated with certain kinds of non-convex constraints from which all algorithms suffer—not only the distributed ones. This should shape our expectation that it is hardly possible to design distributed algorithms which are able to handle general non-convex constraints. The best we can achieve also here is local convergence—i.e. convergence from an initial point close enough to a local minimizer.<sup>12</sup>

## **What's wrong with non-convex constraints?**

Consider the non-convex problem

$$\min\_{\mathbf{x}\in\mathbb{R}^2} 10^{-2} \left( \left( \mathbf{x}\_1 - \mathbf{8} \right)^2 + \left( \mathbf{x}\_2 - \mathbf{1} \right)^2 \right) \tag{2.27a}$$

$$\text{subject to} \qquad \mathbf{x}\_2 - \sin(\mathbf{x}\_1) = \mathbf{0} \tag{2.27b}$$

$$(0.15, \ -1) \,\mathrm{x} = 0.\tag{2.27c}$$

This problem can be written as

$$\min\_{\mathbf{x}} \qquad \qquad \qquad \qquad \qquad 10^{-2} \left( (\mathbf{x}\_1 - \mathbf{8})^2 + (\mathbf{x}\_2 - \mathbf{1})^2 \right) + i\_{\mathcal{X}\_1}(\mathbf{x}) + i\_{\mathcal{X}\_2}(\mathbf{x}) \tag{2.28}$$

with a non-convex constraint set X<sup>1</sup> = { ∈ R 2 | <sup>2</sup> − sin(1) = 0} and a constraint set X<sup>2</sup> = { ∈ R 2 | (0.15, −1) = 0}. The contour lines of the objective and the feasible are depicted in Figure 2.7. The feasible set consists of three points X = {(0, 0) <sup>⊤</sup>, ±(2.72, 0.41) <sup>⊤</sup>}, one of which ( ★ = (2.72, 0.41) ⊤) is globally optimal.

<sup>12</sup> One expectation are centralized global optimization algorithms such as -branch-and-bound [Ste17]. However, these global optimization techniques are typically very expensive and often due to that not applicable to large problems.

Figure 2.7: Contour plots and feasible set of problem (2.27).

Three of the four considered algorithms—ADMM, ALADIN and the method of multipliers—are based on minimizing the augmented Lagrangian which for = 0 reads

$$\mathcal{L}^{\rho}(\mathbf{x},0) = 10^{-2} \left( \left( \mathbf{x}\_1 - \mathbf{8} \right)^2 + \left( \mathbf{x}\_2 - \mathbf{1} \right)^2 \right) + i\_{X\_1}(\mathbf{x}) + \frac{\rho}{2} \left\| \left( \mathbf{0}.15, \ -\mathbf{1} \right) \mathbf{x} \right\|\_2^2 \dots$$

The theory of augmented Lagrangian-based methods predicts that if we iteratively minimize L for a fixed multiplier and we let → ∞, then these methods will converge to the globally optimal solution to (2.27) regardless of the update rule for [Ber99, Prop. 4.2.1]. At first glance, this sounds very promising, but let us have a closer look on L when increasing . Figure 2.8 shows the contour lines and the feasible set of the problem

$$\min\_{\chi} \mathcal{L}^{\rho}(\chi, 0) \tag{2.29}$$

for = 10<sup>8</sup> . One can clearly see that L (, 0) has two local minima in the plotted range: one at the global optimal solution to problem (2.27), (2.72, 0.41) ⊤ but also another one at (7.7, 0.98) ⊤ which is not even a feasible point. This highlights an inherent practical difficulty of augmented Lagrangian methods for non-convex constraints: if L is iteratively minimized with *local solvers* (which is often the case in practice), then the method might converge to a point

Figure 2.8: Feasible set and contours of (2.29) for = 10<sup>8</sup> .

which is neither feasible nor optimal.<sup>13</sup> The premise of the before mentioned theorem is that L is minimized *globally*, which can not be guaranteed here due to using local solvers. Consequently, this can be observed in ADMM, ALADIN and the method of multipliers if their subproblems are solved with local solvers.

### **Numerical behavior**

Figure 2.9 shows the iterates for ADMM, ALADIN and the method of multipliers from an initial point <sup>0</sup> = (5, −0.5) and <sup>0</sup> = 0. One can observe the convergence to the infeasible point (2.72, 0.41) ⊤ for all these methods and similar to the previous chapter the "alternating behavior" of ALADIN and ADMM being projecting between the two constraint sets defined by and = .

Moreover, for SQP, after several iterations close to the infeasible point, SQP produces iterates far from any feasible point (indicated by the blue arrow in Figure 2.9). The reason for this is that ∇ and become very close to linearly dependent at (2.72, 0.41) ⊤ leading to intersection points of the spaces spanned by ∇ and very far away from the previous iterate. Globalization routines

<sup>13</sup> This effect is even strengthened if the local solvers are initialized at the previous iterate which is also often done in practice for computational efficiency reasons.

Figure 2.9: Iterates of ADMM, ALADIN, the method of multipliers, and SQP for problem (2.27).

do also not help here. Most of the global convergence proofs for SQP require either an a priori assumption boundedness of the iterates { }, { } [Sol09], or they require linearly independent constraint linearizations [BT95, Thm 4.1], which are both violated in this case. Even if these assumptions hold, the convergence of SQP methods is typically guaranteed to a *stationary point* of some merit function such as L [BT95, Thm 4.2]. However, these stationary points do not necessarily correspond to a local minimizer as we observed in our example.

Note that this effect can not occur for convex problems. In this case, all subproblems are also convex and solvers necessarily return global optima. In summary, we can conclude the following: Especially non-convex constraints are difficult to handle—not only for distributed but also for centralized algorithms. Thus, designing distributed algorithms with convergence guarantees from *arbitrary* initial points might be out of reach. However, in practical problems we often have initial guesses close enough to local minimizers making these algorithms nonetheless very successful in many domains. A similar class of counterexamples for interior-point methods can be found in [WB00].

# **3 A Survey on Distributed Optimization**

In this chapter, we provide an overview on the state-of-the art of distributed optimization algorithms. Moreover, we investigate what makes these algorithms hard to apply in many practical applications. The amount of literature on distributed optimization grows very fast and this overview is far from being complete. However, we tried to identify main lines of research and aimed at identifying important works for each of these lines.

We categorize the research on distributed optimization along three main lines of research:


Whereas primal-dual methods are mainly based on Lagrangian relaxation (cf. Section 2.2), primal methods work purely in primal space, i.e. they do typically *not* update any form of Lagrange multipliers. The third line of research considers internal decomposition methods. These methods decompose operations of standard nonlinear programming algorithms such as solving a linear system of equations. We begin with primal-dual methods as one of the earliest approaches on distributed optimization.

# **3.1 Primal-dual algorithms**

Early works on distributed optimization trace back to Everett Dantzig, Wolfe and Benders [Eve63; DW60; Ben62] in the 1960s. These first works mainly


Figure 3.1: Timeline of distributed optimization algorithms.

considered Lagrangian relaxation for strictly convex problems and decomposition methods for linear programs. Later, the Lagrangian relaxation was combined with augmented Lagrangian techniques developed mainly by Hestenes, Powell and Miele [Pow69; Hes69; Mie+71; Mie+72] to improve numerical stability and to provide guarantees also for convex but not strictly convex problems. This led to first versions of ADMM with improved convergence guarantees and improved practical convergence [GM76; GM75; EB92]. Many of these duality-based works are summarized excellent textbooks by Bertsekas and Tsitsiklis [BT89] and Censior and Zenios [CZ97]. A timeline of distributed optimization approaches is shown in Figure 3.1.

### **Renewed interest in the late 2000s**

Distributed optimization gained new interest in the late 2000s mainly in the field of machine learning and imaging science, where ADMM outperformed state-of-the art methods in certain applications [ABF10; SMG10; GM12]. Moreover, new applications in signal recovery emerged [CP07; CP11]. The main motivation here was computational speedup, i.e. to find methods for parallel computing.<sup>1</sup> Moreover, state-of-the-art methods have been shown to be a special case of ADMM [GO09; Gol+14] allowing their treatment in a unified framework.

Duality-based optimization methods were used in communication networks [Chi+07] beginning already in the late 1990s [KMT98; LL99]. Similar approaches were used in wireless sensor networks [MBG10; SRG08], in signal

<sup>1</sup> A nice example for this need is given in [BMN01]. There, the authors consider a imagereconstruction problem from from positron emission tomography. Even an evaluation of one gradient of the objective took 15 to 45 min there, so there is no chance for using Newton-based schemes such as IP or SQP methods.

processing [ZGC09] and in certain fields of machine learning (support-vector machines) [FCG10]. A different version of dual decomposition is established in [NS08] called *proximal center method*. Herein—instead of minimizing the augmented Lagrangian in an alternating fashion to achieve separability—the author adds two linear proximal terms which are separable and lead to a differentiable dual function. An early dual-decomposition based approach for a multi-agent setting was presented in [TTM11]. The highly influential paper [Boy+11] showed that many of the above works can be treated in a unified framework based on ADMM. The importance of this framework lies in its generality—the convergence analysis of [Boy+11] carries over to the above applications. An even more general convergence analysis of decomposition methods (including ADMM) based on augmented Lagrangians is given in [ST14]. A dual method combining with smoothing and an excessive gap technique for convex problems is presented in [TDSD13]. The numerical results in this work are promising and the method has a main advantage over other primal-dual methods: the parameters are updated automatically. Using an accelerated scheme in the dual step yields a convergence rate of O (1/ 2 ) for general convex problems. A generalization is given in [TND16] including techniques from [Nes05a; Nes05b] allowing for inexact solutions of the subproblems and providing automatic parameter updates.

The works starting with [WO12] consider a network setting, i.e. they introduce consensus constraints on characteristic matrices of the underlying communication graph (incidence matrix, Laplace matrix, cf. Appendix A.2). [Shi+14] considers a problem formulation in form of (A.10). This leads to an *edgebased* formulation (ADMM maintains two Lagrange multipliers per edge). [MO17] develops a *node-based* formulation, where one Lagrange multiplier vector is maintained for each node in the network, where the network constraint is encoded based on a Laplacian formulation similar to (A.12). Both cases lead to sublinear/linear convergence under convexity/strong convexity assumptions. The very recent work [Mao+20] presents a decentralized method for unconstrained non-convex consensus optimization with guarantees in a network setting.

## **Recent works considering non-convex objectives**

Recently, primal-dual algorithms were successfully extended to problems with non-convex objective function, but convex constraint set with guarantees. An ADMM-like algorithm, where the augmented Lagrangian is minimized in a block-coordinate descent scheme, is presented in [HJ14] based on the convergence analysis of [ABS13]. The analysis is based on new concepts beyond convexity such as the Kurdyka-Łojasiewicz (KL) property and Lipschitz/regularity assumptions. The KL property is for example fulfilled, if the objective and the constraint functions are-semi algebraic. This line of analysis is combined in [HJ16] with generalized equations from [ZA10] to a distributed control setting. The KL framework is developed further in [BST18; ST19a]. Approaches applying ADMM in the field of optimal control can be found in [KCD15; RCG17; BG19]. A dual second order method decomposing strictly convex QPs from optimal control along the time-axis is presented in [FSD15].

The works [Cha+16b; Cha+16a] show that ADMM with standard consensus and strongly convex s converges linearly even under asynchronous operation with bounded communication delay. [CDZ15] and [CZ17] present an ADMMflavoured method providing convergence guarantees also for non-convex under strong second-order sufficient conditions but with convex constraint sets. Moreover, [HLR16; LP15] analyze the convergence properties of ADMM for non-convex problems including certain classes of non-convex constraints. An more general convergence analysis is given in [WYZ19] including the [HLR16; LP15] as special cases. The line of analysis of [WYZ19] is extended in [GGC20] to problems, where the consensus constraint (2.12d) can be a multiaffine mapping.

*Remark 9 (Relation to operator splitting methods)* There are excellent works generalizing primal-dual distributed algorithms in the framework of *operator splitting* and *proximal* methods [CP11; Sta+16; PB14]. This perspective paves the way for unified convergence analysis. Moreover, it leads to suggestions for effective preconditioning [GB15; GB17]. However, this point of view is beyond the scope of the present work as it requires a large amount additional mathematical prerequisites from operator splitting theory. We refer to [BC11] and the above works for further details.


Table 3.1: Typical properties of primal-dual decentralized algorithms.

## **Pros and cons of primal-dual methods**

Primal-dual methods are very successfully applied in many contexts—especially for convex problems. For certain applications—e.g. in embedded optimal control, signal recovery and machine learning—they lead to promising performance [Ste+20; CP07]. However, most of the primal-dual approaches suffer from two drawbacks:


We refer to [DY16; GB16; PSB14] and [HY12] for a detailed convergence analysis.

Recent overviews on decentralized optimization considering primal and dual approaches can be found in [NOR18; NOW18; Boy+11; Yan+19; HM11]. Examples on the application of ADMM to machine learning problems can be found in [Tay+16; Sul+19]. Recent surveys on optimization methods for machine learning including primal and dual methods can be found in [Ber15b; BCN18; SNW12]. Overviews on splitting methods for control with emphasis on time-wise decomposition are given in [Sta+16; Fer+17].

Typical properties of primal-dual algorithms are summarized in Table 3.1.

<sup>2</sup> By setting = 0 one obtains dual decomposition.

## **3.2 Primal algorithms**

The second main line of reserach are primal first-order methods. These methods are usually decentralized variants of the gradient/subgradient or proximal gradient method producing iterates purely *in the primal space*. Hence, they do usually not maintain any form of Lagrange multipliers. We give a brief overview on latest developments in a network setting. For a recent and comprehensive analysis on their centralized counterparts we refer to [Bec17; Ber15a].

#### **Classical approaches**

In contrast to dual methods, primal distributed methods are distributed *in the design* of the optimization algorithms. One of the most general settings is given in [TBA86], which is also one of the earliest and most influential papers on primal methods in distributed optimization. To illustrate the idea of primal methods and to be able to classify relevant literature, we give a simplified version of the iteration scheme typically used in primal methods (assuming deterministics, synchronization, no communication delays and a one-dimensional decision vector). This (simplified) iteration reads

$$\mathbf{x}^{k+1} = W^k \mathbf{x}^k - \boldsymbol{\Lambda}^k \mathbf{s}^k. \tag{3.1}$$

Here, is a so-called *mixing matrix* and encodes some form of gradient/subgradient or Newton step with respect to the individual objective function terms , and Λ is a diagonal matrix providing positive stepsizes for each . The matrix can be seen as a kind of averaging step between neighbors driving the individual toward a consensus and the vector contains some desirable direction of descent in the objective function driving the iteration towards optimality. The connectivity of the Network is here encoded in the matrix (which might be time-varying).<sup>3</sup> Convergence of these methods is guaranteed in a very general setting including asynchronous operation and communication delays [Ber83; TBA86]. An example for an early application of this framework to distributed reinforcement learning can be found in [Tsi94].

<sup>3</sup> The matrix has to fulfill certain assumptions, e.g. it has to be *stochastic* or *doubly stochastic* meaning that all entries have to be non-negative and the column sums of have to be all one.

#### **Consensus and first-generation algorithms**

In the 2000s, so-called *consensus algorithms* gained significant interest [JLM03; Boy+05; OFM07; OT09; Blo+05].<sup>4</sup> Consensus algorithms compute (roughly speaking) weighted-averages between neighbored agents sequentially until a certain level of consensus (i.e. level of "equalness" in the primal variables) is reached. Thus, the iteration of consensus algorithms is +<sup>1</sup> = . By using tools from classical discrete-time control theory, one can accelerate convergence of consensus algorithms by choosing in a certain manner. From this perspective it seems natural that consensus algorithms are closely related to the before-mentioned primal algorithms. The work [BT07] showed that some of them are a special case of the earlier work [TBA86], where roughly speaking speaking the vector in the primal optimization framework (3.1) is zero. However, these development renewed interest in decentralized primal algorithms and led to the development of decentralized subgradient methods (DSG) for non-smooth s (e.g. for 1-regularized problems) [NO09]. Thereby the authors exploit that consensus problem can equivalently be formulated as a *feasibility problem*, i.e. one minimizes a zero function encoding consensus as a constraint which is mathematically described by a constraint set X = { ∈ R | <sup>1</sup> = <sup>2</sup> =, . . . , = } [Ned+14]. The decentralized subgradient method was extended to a stochastic setting with subgradientprojection where all agents have knowledge over a global constraint set and thus also decentralized *constrained* and *non-smooth* optimization became possible [SRNV10; Ned11].

In [DAW12], DSG was extended to a decentralized proximal gradient method based on [Nes09; DAW10] allowing for constraints in a deterministic, synchronous and undelayed setting. It also analyzes the influence of the network structure on the convergence. In the same setting, a decentralized primal-dual subgradient method was proposed in [ZM12]. All the above methods require a diminishing step site to converge to an optimal solution—one reason for making

<sup>4</sup> A consensus problem is a problem, where multiple entities have to agree on something. Examples for this can be opinion dynamics, where after a while the population agrees on a certain political decision; sensor networks where multiple measurements have to be aggregated, i.e. there has to be a "consent" on the measured value; formation control, where autonomously flying objects have to agree on a certain formation, or self-organize biological systems, where swarms of animals (e.g. fishes) move in a certain direction.

them slow compared to other methods [NOR18]. To accelerate convergence, primal methods based on Nesterov's accelerated gradient ([Nes83]) for unconstrained problems over networks have been proposed in [CO12; JXM14]. They exhibit an improved sublinear convergence rate. The work [YLY16] analyzes the convergence of DSG schemes in case of a fixed stepsize. Therein authors show that in case of a fixed stepsize, the accuracy of DSG is limited by the stepsize, i.e. DSG converges to a more exact solution with a smaller stepsize. On the other hand, this slows down convergence and hence there is a tradeoff between convergence speed and achievable accuracy which can be seen as a drawback of DSG methods. This effect is called the *exactness-speed dilemma.* An extension of the analysis of DSG and prox-DSG methods to the non-convex regime is given in [ZY18] showing that most properties from the convex setting are preserved. This work considers new analysis tools suitable for the non-convex case and shows convergence to stationary points (and to limit cycles) under almost the same conditions as in the convex case. An sufficient conditions for optimal convergence in time-varying networks applicable to the algorithms before is given in [Rog+19].

#### **Second-generation algorithms**

A major drawback of the algorithms from before is the need for a diminishing stepsize to guarantee their converges making them typically slow. A new type of decentralized primal method called *Exact First Order Algorithm (EXTRA)* was developed in [Shi+15b] overcoming this difficulty. The main idea there is to introduce a correction term in the decentralized gradient schemes from before, compensating the non-stationarity of the s in the limit. The correction term is a sum of previous iterates—hence, one has to maintain one additional vector while iterating. The update formulas are

$$\mathbf{x}^{k+1} = W^k \mathbf{x}^k - \Lambda^k \mathbf{y}^k,\tag{3.2a}$$

$$\mathbf{y}^{k+1} = W^k \mathbf{y}^k + \nabla f(\mathbf{x}^{k+1}) - \nabla f(\mathbf{x}^k). \tag{3.2b}$$

Here, the first term in (3.2b), , is the newly introduced compensation term. By doing so one yields convergence also *for a fixed step size* overcoming the exactness-speed dilemma. EXTRA yields sublinear convergence for convex objective functions and linear convergence for strictly convex objective functions. An extension to constrained problems based on the proximal gradient method is given in [Shi+15a]. Also numerically, EXTRA has shown to outperform existing DSG schemes. A similar scheme (Aug-DGM) was proposed in [Xu+15] allowing for uncoordinated stepsizes. Algorithms very similar to EXTRA and Aug-DGM haven been proposed in [Ned+17; QL18]. In [Ned+17], the developed (DIGing) algorithm provides convergence guarantees for strongly convex functions over time-varying graphs using new arguments from control theory (small-gain theorem). The authors of [QL18] prove sublinear convergence for convex functions. An extension of [Ned+17] to uncoordinated stepsizes is given in [NOS17]. A Nesterov-accelerated (Acc-DNGD) is given in [QL19] with significantly improved sublinear convergence rate. Asynchronous operation and time-varying directed graphs with linear convergence proof is considered in [Pu+20] with a similar approach given in [XK18]. [MJ18] presents and algorithm similar to DIGing with improved practical convergence and a substantial reduction in communication. Non-convex objectives are considered in [TT17].

A different line of research in distributed primal methods was proposed in [Scu+14]. Therein, the main idea based on [RHL13] is to use a proximalgradient like scheme, where non-convex parts of the objective are linearized and the evaluation of the proximal operator is done in a Jacobi or Gauss-Seidel like procedure. This leads to convergence to stationary points also in case of non-convex with a convex constraint set. Note that this method also needs a diminishing stepsize. These ideas are combined with a dynamic consensus algorithm from [ZM10] leading to a very effective algorithm [DLS16] called NEXT with an empirically linear rate of convergence and convergence speed similar to ADMM. An extension to arbitrary and time-varying digraphs is given in [SSP16] (SONATA) containing NEXT, Aug-DGM and DIGing as special cases.

#### **Decentralized Newton-methods**

Another main line of research is primal decentralized optimization based on Newton-Methods. The article [JOZ09] considers a decentralized Newton scheme very similar to the one we will use later for bi-level ALADIN. The consensus constraint is defined via a graph incidence matrix, i.e. each node maintains only one decision variable. The scheme is unconstrained. The Newton step is computed via a decentralized consensus scheme. Errors in the step computation are included in the convergence analysis. A similar approach is proposed in [TBJ19], where in contrast to the method before, the dual Hessian is converted into a system of symmetric diagonally dominant (SDD) matrices and this new (lifted) system is solved by a decentralized solver specifically designed for SDD systems. A combination of Newton based ideas with average consensus is given in [MLR17; Mok+16] which converge quite slow even for strongly convex problems. The approach in [Var+16] is at least able to gain a practically linear rate of convergence.

## **Pros and cons of primal methods**

A main advantage of primal methods are that thy often come with advanced features such as asynchronous operation, considering time-delays or a changing network topology. However, their convergence is typically very slow although second-generation algorithms made significant progress towards acceleration of convergence. Moreover, having optimization problems from power systems or for distributed optimal control in mind, the current state-of-the art primal methods have the major drawback that they do often not consider constraints. If constraints are considered, they are typically assumed to be convex. Moreover, knowledge of the global constraint set by all agents is sometimes required.

We refer to [NOW18; NOW18; Ned14; Yan+19] for excellent and more comprehensive reviews of distributed primal methods.

As constraints are of major importance in the applications from power systems and control we have in mind, we do not consider primal methods further in this thesis.

<sup>19</sup> The convergence of first-generation algorithm is sublinear mainly due to the required diminishing stepsize. If a constant stepsize is used, however, convergence might be faster but the "exactness" of the solution is limited in that case [NOR18; YLY16]. This effect is called the *exactness-speed dilemma*.


Table 3.2: Typical properties of first and second generation primal decentralized algorithms.

## **3.3 Internal decomposition methods**

The third main line of research are *internal decomposition* methods. Internal decomposition aims at distributing certain steps from centralized nonlinear programming methods such as interior point methods or SQP. By doing so, the convergence guarantees of the "host algorithm" are inherited. In interior point and SQP, the most expensive centralized operation is solving a liner system of equations (KKT systems) in each iteration. Whereas for interior point methods decomposition of the KKT system is often straight-forward, for SQP methods the situation is a bit more difficult. Here, one has to consider an extra routine detecting the active set or, alternatively, use a method which can handle constraints as an inner algorithm.

Classically, parallel linear algebra routines such as the Gauss-Seidel, the Jacobi iteration or the conjugate gradient method were employed for distributed computation of the KKT system [BT89, Chap 2], [Saa03, Chap 4]. However, these classical methods usually decompose along rows/columns of the KKT matrix aiming for load balancing and they are thus typically not directly applicable to a multi-agent setting. Domain decomposition methods (or Schur-complement methods) [Saa03, Chap 14] are more suited for the multi-agent setting and we will use similar techniques in Chapter 4.

### **Distributed interior-point and SQP methods**

Early works on distributed interior point methods for convex problems in form of (2.12) start with [HOS01], where the KKT system is solved by specialized solvers for block-tridiagonal systems. A Schur-complement-based approach is presented in [GG07] and [GG09], where [GG09] efficiently exploits nested tree structures. The work [NS09] combines dual decomposition with ideas from interior point methods, where the KKT system is condensed based on a Schur-complement technique and subsequently solved by a centralized linear solver. A generalization thereof is presented in [Din+13] coming with a novel convergence analysis allowing for inexact solutions of the subproblems. The follow-up work [LT20] generalized this work for broader problem classes based on tools from [ST19b] and a polynomial-time complexity is established. A general structure-exploiting interior point method for problems in form of (2.12) is presented in [PHA14], where the KKT system is solved via ADMM.

Examples for distributed implementations of interior point methods of timedependent problems (e.g. from optimal control) can be found in [ZLB08; Kan+14; Wor+14; HSS17] In these works, decomposition is typically done *over time*, i.e. the problems are often in form of (2.12) but where R is a *time index* rather than a set of subsystems. An augmented-Lagrangian based parallel interior point method for this setting is given in [CSL16]. One recent example of using Schur-complement techniques is implemented in the PARDISO solver [Sch+01] in combination with the very successful interior point method implementation IPOPT [WB06; Ver+17; KFS18; KKS20a; KKS20b]. The main focus of these implementations is the applications to time-dependent problems from power systems such as multi-stage optimal power flow or security-constrained optimal power flow—however the methods used there are in principle applicable to many problems with a similar sparsity structure. A similar approach with application to spatial decomposition in power systems is presented in [Lu+18].


Table 3.3: Typical properties of internal decomposition methods.

For a multi-agent setting, an approach based on primal barrier functions is presented in [BPJ17] and a primal-dual based approach is presented in [BJ17]. These works guarantee convergence with decentralized barrier parameter updates.

An early partially parallel SQP version is presented in [Lei+03]. [Sch15] presents a commercial distributed SQP method.

#### **Pros and cons of internal decomposition methods**

Internal decomposition methods are in general well-suited for constrained optimization. They typically consider some form of constraint-linearization in their step computation, thus the curvature information of the constraints is effectively exploited often leading to very fast convergence of these methods also in the constrained case. However, as all of these methods are secondorder (i.e. they use Hessian and Jacobian information), step-computation is in general much more expensive. Using Schur-complement and reducedspace techniques can substantially reduce the dimension of the KKT system. However, solving the KKT system is in general highly centralized. Moreover, these methods are typically *not* tailored to a multi-agent setting. They are rather designed for parallel computing.

<sup>5</sup> The convergence rate depends on the precision of solving the KKT system and on the "exactness" of the Hessian . Exact Hessians with an exact solutions to the KKT system leads to quadratic convergence, approximations like BFGS lead to superlinear convergence, cf. Example 2.


Table 3.4: Overview on main lines of research for distributed optimization.

For our applications in power systems and control, techniques from internal decomposition seem to be excellent fits as we often have non-convex objectives and constraints. However, one challenge is to reduce the complexity for solving the KKT system.

### **Mixed approaches**

Finally, there are a few algorithms which are hard to categorize along the above lines of research. One of them is an alternating trust region method [HJ17] which converges linearly to local minimizers. The second one is the Augmented Lagrangian Alternating Direction Inexact Newton (ALADIN) method [HFD16]. ALADIN can be viewed as a mix between primal-dual and SQP methods combining distributed optimization with the fast (superlinear or even quadratic) convergence properties of SQP. In contrast to many other methods from before, these two methods are able to handle non-convex constraints. A decomposition scheme for time-wise decomposition of optimal control problems based on ALADIN is presented in [Kou+16].

# **3.4 Comparison of algorithms**

Let us assess the before mentioned algorithm classes in view of desired properties for distributed optimization. Classical primal-dual algorithms such as ADMM and dual decomposition are well investigated and work robustly for a wide range of practical (convex) problems. Moreover, convergence guarantees have recently been extended to special classes of non-convex problems. For separable problem structures, they are able to operate in a decentralized fashion although centralized parameter tuning might be required. Their amount of communication is usually very small and recently also advanced features such as operating on directed graphs or asynchronous operation have been investigated. However, in case of constrained problems, certain undesired alternating projection effects might occur (cf. Section 2.3).

Primal algorithms do very well in decentralized operation as they do not assume a special structure of the problem and the communication graph can be chosen independently from the problem formulation. Moreover, similar to primaldual methods, the communication overhead per iteration is quite small and many of these algorithms are able to operate on directly graphs with changing topology and asynchronously. On the other hand, at least first-generation primal algorithms are very slow. The convergence rate of second-generation algorithms can be similar to primal-dual methods. A drawback of primal algorithms is that considering constraints (even convex ones) is often not possible.

Internal decomposition methods are often not decentralized as they often aim for parallel computing and computational speedup than for a multi-agent setting. As they often solve a reduced KKT system, the communication overhead is quite large and the coordination problem might be expensive to solve due to additional centralized operations such as the update of barrier parameter in interior point methods. However, very fast convergence is often guaranteed also for problems with non-convex objective *and* constraints.

Lastly, ALADIN is a mixture between primal-dual methods and internal decomposition methods inheriting the fast convergence properties of internal decomposition methods and the convergence guarantees for constrained nonconvex problems. Moreover—in contrast to internal decomposition methods— ALADIN offers of distribution by detecting the active constraints locally and


Table 3.5: Existing works in view of desired properties for distributed algorithms.

reducing the coordination step to a linear system of equations. However, compared to primal-dual and primal methods, its coordination and communication effort is quite large. The properties of primal-dual, primal and internal decomposition methods in view of the requirements for distributed optimization are summarized in Table 3.5. Therein (+) and (++) indicate that the considered algorithm fulfills the listed property and (-) and (- -) indicate that it does not.

### **Prospect for application in power systems and control**

The present thesis has mainly applications from power systems and optimal control in mind. Characteristic in these two classes of problems are *nonconvex constraints* which essentially leave two classes of distributed algorithms for further investigation: primal-dual methods and internal decomposition methods. As ALADIN is a mix between these two it seems to be an excellent fit. However, as indicated above, ALADIN has a quite heavy coordination step and requires a quite large amount of central communication. Motivated by this fact, a main focus of the present thesis will be to develop a *decentralized* ALADIN-based algorithm preserving its favorable convergence guarantees and *reducing its communication requirements*. In the next chapter, we present a new algorithmic framework fulfilling these requirements which we call *bi-level ALADIN*.

<sup>6</sup> With advanced features we mean features such as asynchronous operation, the ability to handle communication delays, operation over directed graphs or a changing graph topology.

# **4 Bi-level Distributed ALADIN**

In the previous sections we highlighted that the core advantages of ALADIN are its fast local convergence guarantees, and an efficient handling of nonconvex constraints. Disadvantages are a large amount of communication and a missing decentralization. This chapter proposes a framework in which both is possible—operation based on neighbor-to-neighbor communication (decentralized optimization) and reducing the communication overhead while preserving ALADIN's fast local convergence guarantees.

By having a closer look at Algorithm 5, one can observe that only the coordination step 2) requires central coordination and global communication.<sup>1</sup> Thus, the idea of this chapter is decentralize this step by means of decentralized inner algorithms.

Large parts of this section are based on [Eng+20; Eng+19b]. The sparsity encoding techniques and the decentralized conjugate gradient algorithm are improved versions from [Eng+20] and not published so far. The convergence analysis in Section 4.5 is a unified version of [Eng+20], which does not require any results from [HFD16] and is purely based on nonlinear programming theory from Chapter 2. Similar approaches can also be found in the literature on numerical methods for optimal control [Kou+16; FSD15].

# **4.1 Reducing communication by condensing**

In a first step, we apply so-called reduced-space or condensing techniques to the coordination QP of ALADIN. By doing so, we reduce the dimension of the

<sup>1</sup> A globalization also requires either central coordination or costly additional communication. However, as stated before, here we focus on a *local* algorithm. Designing distributed globalization routines is still an open problem.

QP and thus ALADIN's communication overhead. The basic ideas for doing so are from the literature for nonlinear programming [NW06].

Recall the coordination QP of ALADIN including slack variables (2.25)

$$\min\_{\Delta\mathbf{x},d} \sum\_{i \in \mathcal{R}} \frac{1}{2} \Delta\mathbf{x}\_i^\top B\_i^k \Delta\mathbf{x}\_i + \nabla f\_i^\top (\mathbf{x}\_i^k) \Delta\mathbf{x}\_i + \boldsymbol{\lambda}^{k\top} d + \frac{\rho}{2} \|d\|\_2^2 \tag{4.1a}$$

$$\text{subject to} \qquad C\_i^k \Delta \mathbf{x}\_i = 0 \qquad \forall i \in \mathcal{R}, \tag{4.1b}$$

$$\sum\_{i \in \mathcal{R}} A\_i (\boldsymbol{\chi}\_i^k + \Delta \boldsymbol{\chi}\_i) - b = d \qquad \mid \boldsymbol{\lambda}^{\mathrm{QP}}.\tag{4.1c}$$

The first technique we apply is the so-called nullspace-method [NW06, Ch. 12] eliminating the constraint linearizations ∈ R ℎ× from (4.1b). To this end, we need a basis of the nullspace of . Let us construct a matrix ∈ R × (− ℎ ) with = 0 for all ∈ R forming a basis of this nullspace with its columns.<sup>2</sup> This allows parameterizing Δ = Δ and thus

$$\min\_{\Delta\boldsymbol{v},d} \sum\_{i \in \mathcal{R}} \frac{1}{2} \Delta\boldsymbol{v}\_i^\top \bar{\boldsymbol{B}}\_i^k \Delta\boldsymbol{v}\_i + \bar{\boldsymbol{g}}\_i^{k\top} \Delta\boldsymbol{v}\_i + \boldsymbol{\lambda}^{k\top} \boldsymbol{d} + \frac{\nu}{2} \|\boldsymbol{d}\|\_2^2 \tag{4.2a}$$

$$\text{subject to} \sum\_{i \in \mathcal{R}} A\_i \mathbf{x}\_i^k + \vec{A}\_i^k \Delta \mathbf{v}\_i - b = d \qquad | \; \lambda^{\text{QP}}.\tag{4.2b}$$

In case of many active constraints, also the reduced Hessian ¯ := ⊤ , the reduced gradient ¯ := ⊤ and the reduce coupling matrices ¯ = are much smaller in dimension compared with (4.1). We also reduce dimension of the decision vector Δ from to − ℎ. Hence, we reduce communication when solving (4.2) instead of (4.1) in step 2) of ALADIN. A detailed communication analysis will be given in Section 4.6.

<sup>2</sup> If rows of are linearly dependent though, the number of rows of is − ℎ + , where is the number of linearly dependent rows of .

#### **The Schur complement**

Employing the Schur-complement allows to further reduce the dimensionality of the QP (4.2). For doing so, we need form the inverse of ¯ . Hence, we assume the following.

*Assumption 1* The reduced Hessian approximation ¯ is symmetric and positive definite for all iterates ∈ N and all ∈ R.

This assumption holds for example if ∇ L ( , , ) is positive definite on the range space spanned by ∇ℎ( ) for all ∈ N. However, it also holds by choosing an appropriate Hessian approximation ¯ such that the assumption holds. We will further comment on how to do so in Chapter 6. Moreover, we assume the following.

*Assumption 2* The matrix = [1, . . . , ] has full row rank.

As ≻ 0, the KKT conditions (2.2) for problem (4.2) are necessary and sufficient for a global minimizer to (4.2). The Lagrangian to (4.2) is

$$\begin{split} \mathcal{L}^{\mathrm{QP}}(\Delta \boldsymbol{\nu}, \boldsymbol{d}, \boldsymbol{\lambda}^{\mathrm{QP}}) := \sum\_{i \in \mathcal{R}} \frac{1}{2} \Delta \boldsymbol{\nu}\_{i}^{\top} \boldsymbol{\mathcal{B}}\_{i}^{k} \Delta \boldsymbol{\nu}\_{i} + \bar{\mathsf{g}}\_{i}^{k \top} \Delta \boldsymbol{\nu}\_{i} + \boldsymbol{\lambda}^{k \top} \boldsymbol{d} + \frac{\rho}{2} \lVert \boldsymbol{d} \rVert\_{2}^{2} \\ &+ \boldsymbol{\lambda}^{\mathrm{QP} \top} \left( \sum\_{i \in \mathcal{R}} A\_{i} \boldsymbol{x}\_{i}^{k} + \bar{A}\_{i}^{k} \Delta \boldsymbol{\nu}\_{i} - \boldsymbol{b} - \boldsymbol{d} \right). \end{split}$$

Thus, the KTT conditions ∇LQP(Δ, , QP) = 0 read

$$
\mathcal{B}^k \Delta \nu + \bar{\mathbf{g}}^k + \bar{A}^{k \top} \lambda^{\text{QP}} = 0,\tag{4.3a}
$$

$$
\lambda^k + \rho d - \lambda^{\text{QP}} = 0,\tag{4.3b}
$$

$$A^k \alpha^k + \bar{A}^k \Delta \nu - b - d = 0,\tag{4.3c}$$

where ¯ <sup>=</sup> diag∈R (¯ ), ¯ = (¯ ⊤ 1 , . . . , ¯ ⊤ 1 ) ⊤ , ¯ <sup>=</sup> (¯ <sup>1</sup>, . . . , ¯) and Δ = (Δ ⊤ 1 , . . . , Δ ⊤ ) ⊤ . Here, diag∈R () denotes diagonal concatenation of matrices indexed by the set R. By eliminating (4.3b) with = 1 ( QP − ), we obtain

¯ Δ + ¯ + ¯ ⊤ QP = 0, (4.4a)

$$A^k x^k + \bar{A}^k \Delta \nu - b - \frac{1}{\rho} (\lambda^{\text{QP}} - \lambda^k) = 0. \tag{4.4b}$$

In order to further reduce dimensionality of (4.4), we rearrange (4.4a) as

$$
\Delta \boldsymbol{\nu} = -\vec{B}^{k^{-1}} (\bar{\mathbf{g}}^k + \vec{A}^{k\top} \boldsymbol{\lambda}^{\text{QP}}), \tag{4.5}
$$

where we used Assumption 1. Inserting to (4.4b) yields

$$\left(\underbrace{\bar{A}^k \, \bar{B}^{k^{-1}} \, \bar{A}^{k^T}}\_{:= S^k} + \frac{1}{\rho}I\right) \lambda^{\mathrm{QP}} = \underbrace{A^k \, \varkappa^k - \bar{A}^k \, \bar{B}^{k^{-1}} \bar{g}^k}\_{:= s^k} - b + \frac{1}{\rho} \lambda^k. \tag{4.6}$$

#### **Exploiting block structure**

Because of the block structure of ¯ and ¯ in (4.6), one can write

$$S^k = \bar{A}^k \, \bar{B}^{k^{-1}} \, \bar{A}^{k \top} = \sum\_{i \in \mathcal{R}} \tilde{A}\_i^k \, \bar{B}\_i^{k^{-1}} \, \bar{A}\_i^{k \top} = \sum\_{i \in \mathcal{R}} S\_i^k \tag{4.7}$$

with :<sup>=</sup> ¯ ¯ −1 ¯ <sup>⊤</sup> . Moreover,

$$\mathbf{s}^{k} = A\mathbf{x}^{k} - \bar{A}^{k}\bar{B}^{k^{-1}}\bar{\mathbf{g}}^{k} = \sum\_{i \in \mathcal{R}} A\_{i}\mathbf{x}\_{i}^{k} - \bar{A}\_{i}\bar{B}\_{i}^{k^{-1}}\bar{\mathbf{g}}\_{i}^{k} = \sum\_{i \in \mathcal{R}} s\_{i}^{k} \tag{4.8}$$

with := − ¯ ¯ −1 ¯ . Observe that the variables and can be computed entirely locally the subsystems ∈ R. Hence, it suffices that the subsystems communicate these two quantities to a coordinator, which then solves (4.6).

Note that (4.6) is a linear system of equations of dimension of the number of consensus constraints , which is typically much smaller than the total number of variables + + . If one solves (4.6) for QP, all Δ and Δ can be computed by back substitution in (4.5) and by Δ = Δ .

Figure 4.1: Sparsity patterns of { }∈{1,2,3} for the 5-bus system (Figure 5.2b).

In summary, this subsection showed that instead of solving (4.1) one can solve (4.6) in the coordination step 2) of ALADIN (Algorithm 5). This reduces communication and coordination drastically in case of problems with many active constraints. In this case, it suffices to communicate sensitivities of reduced dimension (Schur complements) and vectors instead of the full sensitivities. However, note that although this version of ALADIN requires less communication, ALADIN still requires central coordination. In the next subsection we derive an algorithm solving (4.6) in a *decentralized* fashion rendering ALADIN a decentralized optimization algorithm.

## **4.2 Decentralization of coordination**

Next we develop two algorithms solving the reduced coordination step (4.6) in a *decentralized* fashion, which means purely based on *neighbor-to-neighbor* communication. For doing so, we first analyze the sparsity pattern of .

### **The sparsity of**

Although the matrix is generally dense, the matrices are not necessarily. Figure 4.1 shows the sparsity patterns from an optimization problem from power systems which we will use in Chapter 5. Here, certain rows/columns seem to be structurally zero. This is the case since often the rows of express coupling between variables in *two* different subsystems. The zero rows/columns in come from the *zero rows* in of all *remaining subsystems*. In the following we call the two subsystems , ∈ R, which share a common non-zero row in their *assigned* to this row.

*Definition 6* A subsystem ∈ R is called assigned to consensus constraint ∈ C = {1, . . . , } if the th row of is non-zero.

Moreover, we collect all subsystems assigned to consensus constraint ∈ C in

$$\mathcal{R}(c) \coloneqq \{ i \in \mathcal{R} \mid i \text{ assigned to } c \in \mathcal{C} \},$$

and we collect all consensus constraints assigned to a subsystem ∈ R in

$$\mathcal{C}(i) := \{ c \in \mathcal{C} \mid i \in \mathcal{R}(c) \}.$$

With that, we are ready to make a statement on the sparsity of .

*Lemma 1* The rows and columns of and the entries of with ∉ C () are zero.

*Proof.* By Definition 6, all rows of ¯ = with ∉ C () are zero. It follows immediately that the rows and columns of <sup>=</sup> ¯ ¯ −1 ¯ <sup>⊤</sup> and the entries of = − ¯ ¯ −1 ¯ with ∉ C () are zero. □

## **4.2.1 Exploiting sparsity for decentralization**

The idea now is to exploit this sparsity for decentralization. Our goal is to formulate (4.6) as

$$\left(\sum\_{i\in\mathcal{R}}\tilde{\mathbf{S}}\_i^k\right)\lambda^{\mathrm{QP}}=\sum\_{i\in\mathcal{R}}\tilde{s}\_i^k,\tag{4.9}$$

where ˜ and ˜ are formed entirely locally and preserve the sparsity of and . Two quantities hindering us are <sup>1</sup> and ( 1 − ) in (4.6) which do not "belong" to any of the subsystems.

The main idea in the following is, to "distribute" <sup>1</sup> and ( 1 − ) to subproblems, which are assigned to the corresponding consensus constraint. To this end, we define

$$\tilde{S}\_i^k := S\_i^k + \sum\_{c=1}^{n\_c} \frac{\iota\_{\mathcal{C}(i)}(c)}{|\mathcal{R}(c)|\,\rho} I\_c,\quad \text{and}\quad \tilde{s}\_i^k := s\_i^k + \sum\_{c=1}^{n\_c} \frac{\iota\_{\mathcal{C}(i)}(c)}{|\mathcal{R}(c)|\,\rho} (\lambda\_c^k - b\_c) e\_c. \tag{4.10}$$

where is a zero matrix except for [] = 1, <sup>C</sup> (·) is the indicator function of the set C, and ∈ R is the th unit vector. This way, the sparsity patterns of and are preserved and (4.6) is reformulated in form of (4.9).

This sparsity pattern can in turn be exploited by ADMM leading to a *decentralized ADMM*. To this end, we express (4.9) as a strongly convex QP.

*Lemma 2* The QP

$$\min\_{\tilde{\lambda}} \quad \frac{1}{2} \lambda^{\top} \sum\_{i=1}^{R} \tilde{S}\_{i} \tilde{\lambda} - \sum\_{i=1}^{R} \tilde{s}\_{i}^{\top} \tilde{\lambda} \tag{4.11}$$

is strongly convex, it's minimizer is unique and solves (4.9).

*Proof.* First, we prove strong convexity of (4.11), i.e. ˜ :<sup>=</sup> Í =1 ˜ ≻ 0. By Assumption 1 we have ¯ ≻ 0 and hence ¯ ≻ 0. Thus, ¯ −1 ≻ 0. Furthermore, by having full column rank, ¯ <sup>=</sup> has full row rank by Assumption 2. Hence, :<sup>=</sup> ¯ ≠ 0 for all ≠ 0 and thus <sup>=</sup> ¯ ¯ −1 ¯ <sup>⊤</sup> ≻ 0 by <sup>⊤</sup>¯ −1 > 0. By the definition of ˜, (4.10), we have ˜ <sup>=</sup> <sup>+</sup> 1 ≻ 0 by ≻ 0.

Let us show that ˜ <sup>=</sup> ˜<sup>⊤</sup> . We have ⊤ <sup>=</sup> (¯ ¯ −1 ¯ ⊤) <sup>⊤</sup> <sup>=</sup> ¯ (¯ −1 ) <sup>⊤</sup> ¯ <sup>⊤</sup> <sup>=</sup> ¯ ¯ (¯ ⊤ ) <sup>−</sup>1¯ ⊤. By Assumption 1, we have ¯ ⊤ = ¯ and hence the assertion follows.

It remains to show that the minimizer of (4.11) is unique and coincides with the solution to (4.9). Uniqueness follows from strong convexity. Again from strong convexity, the first-order necessary conditions are also sufficient for a minimum. They read <sup>1</sup> 2 ( Í =1 ˜ + Í =1 ˜⊤ )¯ − Í =1 ˜ = 0 and thus coincide with (4.9) by symmetry of ˜. □

Figure 4.2: Sparsity patterns for C () ∈{1,2,3} .

## **4.2.2 Decentralized ADMM**

To derive a decentralized variant of ADMM (d-ADMM), we restate (4.11) in so-called *consensus form* (A.9). We introduce matrices C () ∈ R | C () |× mapping from global Lagrange multipliers ¯ ∈ R to local ones ∈ R | C () | by :<sup>=</sup> C ()¯. These matrices are defined as the horizontal concatenation of unit vectors ∈ R indexed by the set R as <sup>R</sup> := ⊤ ∈R <sup>∈</sup> <sup>R</sup> | R |× . Alternatively, these matrices may be viewed as identity matrices, where certain rows have been "deleted" eliminating zero rows/columns in ˜ and ˜ . Figure 4.2 shows exemplary sparsity patterns of {C () }∈R for an example from power systems from Section B.4. The matrices C () have an interesting property, namely that ˜ and ˜ are invariant under multiplication with ⊤ C () C () .

*Lemma 3 (Invariance of* ˜ *and* ˜ *under multiplication with* ⊤ C () C () *)* It holds that ⊤ C () C () ˜ <sup>=</sup> ˜ and ⊤ C () C () ˜ = ˜ .

*Proof.* By definition, C () is composed of unit vectors for which ⊤ = 1, if and only if = and ⊤ = 0 else. Thus, ⊤ C () C () is an identity matrix with zero rows/columns for all rows/columns indexed by ∉ C (). As these rows are anyways zero in ˜ and ˜ by Lemma 1, the assertion follows. □

Algorithm 6. d-ADMM for problem (4.12) **Initialization:** ¯<sup>0</sup> **,** 0 **for all** ∈ R**,**

$$\begin{aligned} \text{Repeat for all } i \in \mathcal{R};\\ \text{1) } \quad \lambda\_i^{n+1} &= \left(\hat{S}\_i^k + \rho I\right)^{-1} \left(\hat{s}\_i^k - \gamma\_i^n + \rho \bar{\lambda}\_i^n\right) \\ \text{2) } \quad \bar{\lambda}\_i^{n+1} &= (\Lambda\_i)^{-1} \sum\_{j \in N(i)} I\_{ij} \lambda\_j^{n+1} \\ \text{3) } \quad \gamma\_i^{n+1} &= \gamma\_i^n + \rho (\lambda\_i^{n+1} - I\_{C(i)} \bar{\lambda}^{n+1}) \end{aligned} \tag{local border-to-neighbor}$$

With that, we are able to reformulate (4.11) in consensus form. Consider (¯) := 1 2 ¯⊤˜ ¯ − ˜ ⊤ ¯ as local objectives. By Lemma 3, (¯) = ( ⊤ C () C () ¯) = ( ⊤ C () ). Thus, problem (4.11) can be written as

$$\begin{aligned} \min\_{\lambda\_1, \dots, \lambda\_R, \vec{\lambda}} & \quad \sum\_{i \in \mathcal{R}} f\_i(I\_{C(i)}^\top \lambda\_i) \\ \text{subject to} & \quad \lambda\_i = I\_{C(i)} \vec{\lambda} \mid \text{ } \gamma\_i \text{ for all } i \in \mathcal{R}, \end{aligned} \tag{4.12}$$

with Lagrange multipliers ∈ R | C () | .

#### **Solution via ADMM**

Let us evaluate the ADMM iterations (Algorithm 3) for problem (4.12). The Augmented Lagrangian for (4.12) is

$$\begin{split} \mathcal{L}^{\rho}(\boldsymbol{\lambda}\_{1}, \ldots, \boldsymbol{\lambda}\_{R}, \overline{\boldsymbol{\lambda}}, \boldsymbol{\gamma}\_{1}, \ldots, \boldsymbol{\gamma}\_{R}) &= \\ \sum\_{i \in \mathcal{R}} f\_{i}(\boldsymbol{I}\_{C(i)}^{\top}\boldsymbol{\lambda}\_{i}) + \boldsymbol{\gamma}\_{i}^{\top}(\boldsymbol{\lambda}\_{i} - \boldsymbol{I}\_{C(i)}\boldsymbol{\lambda}) + \frac{\rho}{2} \|(\boldsymbol{\lambda}\_{i} - \boldsymbol{I}\_{C(i)}\boldsymbol{\lambda})\|\_{2}^{2} . \end{split} \tag{4.13}$$

Necessary conditions for minimizing (4.13) with respect to 1, . . . , are

$$(\hat{S}\_i^k \mathcal{X}\_i^{n+1} - \hat{s}\_i^k + \gamma\_i^n + \rho (\lambda\_i^{n+1} - \bar{\lambda}\_i^n) = 0, \text{ for all } i \in \mathcal{R},$$

where ˆ := C () ˜ ⊤ C () and ˆ := C () . Rearranging terms yields step 1) of Algorithm 6.

Step 2) of Algorithm 3 minimizes (4.13) with respect to ¯. The first-order optimality conditions are

$$\sum\_{i \in \mathcal{R}} -I\_{C(i)}^\top \gamma\_i^n - \rho I\_{C(i)}^\top (\lambda\_i^{n+1} - I\_{C(i)} \vec{\lambda}^{n+1}) = 0. \tag{4.14}$$

Step 3) of Algorithm 3 is

$$
\gamma\_i^{n+1} = \gamma\_i^n + \rho \left( \lambda\_i^{n+1} - I\_{C(i)} \vec{\lambda}^{n+1} \right). \tag{4.15}
$$

Summing up for all ∈ R yields

$$\sum\_{i \in \mathcal{R}} \gamma\_i^{n+1} = \sum\_{i \in \mathcal{R}} \gamma\_i^n + \rho \left( \lambda\_i^{n+1} - I\_{C(i)} \vec{\lambda}^{n+1} \right).$$

Multiplying by ⊤ C () from the left together with (4.14) implies that Í ∈R ⊤ C () +1 = 0 after the first ADMM iteration. Hence (4.14) becomes

$$\sum\_{i \in \mathcal{R}} I\_{\mathcal{C}(i)}^{\top} \lambda\_i^{n+1} - I\_{\mathcal{C}(i)}^{\top} I\_{\mathcal{C}(i)} \tilde{\lambda}^{n+1} = 0. \tag{4.16}$$

Multiplying by C ( ) from the left yields

$$\sum\_{i \in \mathcal{R}} \underbrace{I\_{\mathcal{C}(j)} I\_{\mathcal{C}(i)}^{\top}}\_{:= I\_{ji}} \lambda\_i^{n+1} - I\_{\mathcal{C}(j)} \underbrace{\sum\_{i \in \mathcal{R}} I\_{\mathcal{C}(i)}^{\top} I\_{\mathcal{C}(i)}}\_{:= \Lambda} \underbrace{\vec{\lambda}^{n+1}}\_{:= \Lambda} = 0. \tag{4.17}$$

Since C ( ) ⊤ C ( ) = and = for diagonal matrices , , we have

$$I\_{\mathcal{C}(j)}\Lambda\vec{\lambda}^{n+1} = I\_{\mathcal{C}(j)}I\_{\mathcal{C}(j)}^{\top}I\_{\mathcal{C}(j)}\Lambda\vec{\lambda}^{n+1} = I\_{\mathcal{C}(j)}\Lambda I\_{\mathcal{C}(j)}^{\top}I\_{\mathcal{C}(j)}\vec{\lambda}^{n+1} = \Lambda\_j\vec{\lambda}\_j^{n+1},$$

where we define Λ := C () Λ ⊤ C () . Hence, (4.17) can be written as

$$
\lambda\_j^{n+1} = \left(\Lambda\_j\right)^{-1} \sum\_{i \in \mathcal{R}} I\_{fi} \lambda\_i^{n+1}.
$$

Figure 4.3: Coupling matrices <sup>12</sup> and <sup>13</sup> corresponding to the matrices C () ∈{1,2,3} shown in Figure 4.2.

As = 0 for ∉ (), we can replace ∈ R by ∈ ( ), where we define the *neighborhood* () of subsystem ∈ R by

$$N(i) \coloneqq \{ j \in \mathcal{R} \mid \text{there exists a } c \in \mathcal{C} \text{ with } i, j \in \mathcal{R}(c) \}.$$

The identity = 0 for ∉ () follows from this definition and the definition of C () . This yields step 2) of Algorithm 6.

Note that in Algorithm 6 we have entirely local steps (step 1) and step 3)) and one neighbor-to-neighbor step (step 2)). This means that the algorithm is *decentralized*, i.e. all communication is done locally between neighbors and a central entity is *not* required. Superscripts (·) denote d-ADMM iterates. Quantities with superscripts (·) do not change during the d-ADMM iterations—they refer to (outer) ALADIN iterations. Furthermore, note that the initial guesses for ¯<sup>0</sup> have to be *consistent*, i.e. they have to satisfy ¯<sup>0</sup> <sup>=</sup> C ()¯<sup>0</sup> for all ∈ R.

Observe that the matrices ∈ R | C () |× | C ( ) | encode "coupling information" between the subsystems. More specifically, they define which entries of the vectors , and in subsystem ∈ R belong which entry of the vectors in subsystem ∈ R. Furthermore, observe that = . The sparsity patterns of <sup>12</sup> and <sup>13</sup> corresponding to C (1) , C (2) , C (3) from Figure 4.2 are illustrated in Figure 4.3. One can observe that subsystem 1 only requires the first four entries of <sup>2</sup> from subsystem 2.

d-ADMM is guaranteed to converge to the minimizer of (4.12) and the minimizer of (4.12) corresponds to the solution of (4.9) by Lemma 2. Hence, we

Figure 4.4: Communication graph for our power systems example.

achieved our goal of deriving a decentralized version in the coordination step of ALADIN.

*Remark 10 (Global variable consensus vs. general consensus)* An alternative way of reformulating (4.11) would be via the (simpler) so-called *global variable* consensus [Boy+11]

$$\begin{aligned} \min\_{\lambda\_1, \dots, \lambda\_R, \bar{\lambda}} & \quad \sum\_{i \in \mathcal{R}} f\_i(\lambda\_i) \\ \text{subject to} & \quad \lambda\_i = \bar{\lambda} \quad | \quad \gamma\_i, \text{ for all } i \in \mathcal{R}. \end{aligned} \tag{4.18}$$

as we did in [Eng+20]. However, this leads to the situation that = Í ∈R loses it's strict convexity property with respect to the inflated variable block <sup>⊤</sup> = ( ⊤ 1 , . . . , ⊤ ) if not all depend on all components of as we have here because of the sparsity of ˜ . This is *not* the case in (4.12) since we eliminate locally redundant variables by means of the matrices C () . This is important since convexity vs. strict convexity can lead to sublinear vs. linear convergence for ADMM [MO17; HY12; GB16; HL17; Shi+14].

*Remark 11 (Communication graph)* One can regard the set of subsystems R as nodes and the set E := {(, ) ∈ R × R | there exists a ∈ C with , ∈ R ()} as edges in a communication graph = (N, E). The edges of this graph can equivalently be characterized by all (, ) for which ≠ 0. For the example from power systems from Section B.4, the graph is fully connected and displayed in Figure 4.4.

## **4.2.3 Decentralized conjugate gradient**

Using d-ADMM as an inner coordination algorithm has the advantage that only very few information has to be exchanged between neighboring subsystems, namely the average primal values ¯. However, ADMM requires tuning which might be difficult (cf. Section 5.5) and the convergence rate is at most linear.

To address these drawbacks, the idea of this section is to develop a decentralized variant of the *conjugate gradient* algorithm (d-CG) with fast convergence. A main advantage of CG is, its convergence guarantee in *at most steps* to the *exact* solution of a linear system of equations [NW06, Ch. 5.1]. Although this property is weakened in practice due to numerical round-off errors, the practically observed convergence rate is superlinear, which is very fast compared with other methods such as ADMM [BK01].

The idea of CG is to iteratively minimize (4.11) by generating iterates {¯ } which are ˜*-conjugate* to each other, i.e. ¯⊤˜¯ <sup>=</sup> <sup>0</sup> for all <sup>≠</sup> . ˜ conjugacy can be seen as a generalization of orthogonality. Performing a one-dimensional minimizations along these conjugate directions {¯ } yields the conjugate gradient algorithm [NW06, Ch. 5.1]. The iterations are given by

$$\alpha^n = \frac{r^{n\top}r^n}{p^{n\top}\tilde{S}p^n},\tag{4.19a}$$

$$
\bar{\lambda}^{n+1} = \bar{\lambda}^n + \alpha^n p^n,\tag{4.19b}
$$

$$r^{n+1} = r^n - \alpha^n \tilde{S} p^n,\tag{4.19c}$$

$$\beta^n = \frac{r^{n+1\top}r^{n+1}}{r^{n\top}r^n},\tag{4.19d}$$

$$p^{n+1} = r^{n+1} + \beta^n p^k,\tag{4.19e}$$

where the initialization has satisfy <sup>0</sup> = <sup>0</sup> <sup>=</sup> ˜−˜¯<sup>0</sup> . Now, let us again use the sparsity results from Lemma 1, Lemma 3 and the following additional result.

*Lemma 4* It holds that Í ∈R ⊤ C () Λ −1 C () = .

*Proof.* Since Λ is diagonal, and by definition of C () as the concatenation of unit vectors, we have Λ −1 = ( C () Λ ⊤ C () ) <sup>−</sup><sup>1</sup> = C () Λ −1 ⊤ C () . Combination with Lemma 3 and = for diagonal matrices yields Í ∈R ⊤ C () Λ −1 C () = Í ∈R ⊤ C () C () Λ −1 ⊤ C () C () =

$$\sum\_{i \in \mathcal{R}} \overset{\circlearrowright}{I}\_{\mathcal{C}(i)}^{\mathcal{C}(i)} \overset{\circlearrowright}{I}\_{\mathcal{C}(i)}^{\mathcal{C}(i)} \overset{\circlearrowright}{I}\_{\mathcal{C}(i)}^{\mathcal{T}} \overset{\circlearrowright}{I}\_{\mathcal{C}(i)} \Lambda^{-1} = \sum\_{i \in \mathcal{R}} \overset{\circlearrowright}{I}\_{\mathcal{C}(i)}^{\mathcal{T}} \overset{\circlearrowright}{I}\_{\mathcal{C}(i)}^{\mathcal{C}(i)} \Lambda^{-1} = \Lambda \Lambda^{-1} = I. \qquad \square$$

The main idea in decentralized conjugate gradient is to exploit sparsity in (4.19a)-(4.19e) by the mappings C () which we used already for d-ADMM. Let us start by defining—similar to d-ADMM—local equivalents for all involved variables , ¯ and

$$\lambda\_i \coloneqq I\_{\mathcal{C}(i)} \vec{\lambda}, \qquad r\_i \coloneqq I\_{\mathcal{C}(i)} r, \quad \text{and} \qquad p\_i \coloneqq I\_{\mathcal{C}(i)} p. \tag{4.20}$$

With these definitions, we decompose (4.19a). For doing so, we define = and := ⊤ . The idea is to write this identity in terms of instead of . By Lemma 4, we write as

$$r^{n\top}r^n = r^{n\top}I\;r^n = r^{n\top} \left(\sum\_{i\in\mathcal{R}} I\_{C(i)}^\top \Lambda\_i^{-1} I\_{C(i)}\right)r^n.$$

Defining := ⊤ Λ −1 and by using (4.20) we obtain

$$\eta = r^{n\top} r^n = \sum\_{i \in \mathcal{R}} r\_i^{n\top} \Lambda\_i^{-1} r\_i^n = \sum\_{i \in \mathcal{R}} \eta\_i,$$

where can be computed locally. However, note that the sum Í ∈R requires *global communication* of one scalar per subsystem. This yields step 3) (a) and step 4) of Algorithm 7. Next, we decompose the denominator := ⊤ ˜ in (4.19a). By the definition of , ˜ and by Lemma 3, the denominator can be written as

$$\sigma^n = p^{n\top} \sum\_{i \in \mathcal{R}} \tilde{S}\_i p^n = p^{n\top} \sum\_{i \in \mathcal{R}} I\_{\mathcal{C}(i)}^\top I\_{\mathcal{C}(i)} \tilde{S}\_i I\_{\mathcal{C}(i)}^\top I\_{\mathcal{C}(i)} p^n = \sum\_{i \in \mathcal{R}} p\_i^{n\top} \hat{S}\_i p\_i^n = \sum\_{i \in \mathcal{R}} \sigma\_i^n$$

This yields step 1) and step 5) (c) of Algorithm 7. Let us consider (4.19b) and (4.19e). They are entirely local steps since multiplying both equations by C () from the left yields

$$\bar{\lambda}\_i^{n+1} = \bar{\lambda}\_i^n + \frac{\eta^n}{\sigma^n} p\_i^n \quad \text{and} \quad p\_i^{n+1} = r\_i^{n+1} + \frac{\eta^{n+1}}{\eta^n} p\_i^{k\_i}$$

Algorithm 7. d-CG for problem (4.12) **Initialization:** <sup>0</sup> **and** <sup>0</sup> = <sup>0</sup> <sup>=</sup> ˜ <sup>−</sup> ˜ <sup>0</sup> **Repeat for all** ∈ R**:** 1) = Í ∈R (scalar global sum) 2) +1 = − Í ∈ () (neighbor-to-neighbor) 3) (a) +1 = +1 Λ −1 +1 (b) +1 = + (local) 4) +<sup>1</sup> = Í ∈R +1 (scalar global sum) 5) (a) +1 = +1 + +1 (b) +1 = ˆ +1 (c) +1 = +1⊤ ˆ +1 (local)

comprising step 5) (a) and step 3) (b) of Algorithm (7). In a last step, we need to decompose (4.19c) which requires neighbor-to-neighbor communication due to the product ˜ as we shall see. Again, by Lemma 3 and by the definitions of and ˆ , we have

$$r^{n+1} = r^n - \frac{\eta^n}{\sigma^n} \tilde{S} p^n = r^n - \frac{\eta^n}{\sigma^n} \sum\_{j \in \mathcal{R}} \tilde{S}\_j p^n = r^n - \frac{\eta^n}{\sigma^n} \sum\_{j \in \mathcal{R}} I\_{C(j)}^\top \hat{S}\_j p\_j^n.$$

Multiplying by C () from the left yields

$$r\_i^{n+1} = r\_i^n - \frac{\eta^n}{\sigma^n} \sum\_{j \in N(i)} I\_{ij} \hat{S}\_j p\_j^n = r\_i^n - \frac{\eta^n}{\sigma^n} \sum\_{j \in N(i)} I\_{ij} u\_j^n,$$

where we recall that := C () ⊤ C ( ) and :<sup>=</sup> <sup>ˆ</sup> . Note that we only sum over the neighbors of subsystem as = 0 if ≠ (). Moreover, the quantities = ˆ can again be computed locally by each subsystem. This yields step 2) and step 5) (b) of Algorithm 7.

*Remark 12 (Decentralized operation via hopping)* Note that although for computing and global communication is required, this communication may also be executed in decentralized fashion via "hopping". By that we mean that each subsystem adds it's to the sum which was received from a neighbor until each subsystem was reached. By a second "hopping" round, the sum can then be "broadcasted" to all subsystems. In general, such an approach is somewhat problematic as it requires all subsystems to wait until one hopping round has to be performed, which can potentially take a long time in case of bad local communication links as these delays sum up. However, as in d-CG we only have *one scalar* per subsystem independently of the problem size, this approach might be justified.

## **4.2.4 Consistent initialization**

Note that d-CG requires a consistent initialization satisfying <sup>0</sup> = <sup>0</sup> <sup>=</sup> ˜−˜ <sup>0</sup> . Multiplying this equation by C () from the left and using ⊤ C () C () ˜ <sup>=</sup> ˜ yields

$$r\_i^0 = p\_i^0 = \sum\_{j \in \mathcal{R}} I\_{C(i)} \tilde{s}\_j - \sum\_{j \in \mathcal{R}} I\_{C(i)} \tilde{S}\_j \lambda^0 = \sum\_{j \in \mathcal{R}} I\_{ij} \hat{s}\_j - \sum\_{j \in \mathcal{R}} I\_{ij} \hat{S}\_j \lambda\_j^0. \tag{4.21}$$

Thus, when using a zero initialization <sup>0</sup> = 0, one needs one extra neighbor-toneighbor step for initialization of d-CG. When using warm starting—i.e. using the solution of the previous outer bi-level ALADIN iteration as an initial guess 0 ≠ 0—one needs one additional neighbor-to-neighbor step for computing the second term in (4.21). Note that the vectors 0 are all known locally from the previous outer bi-level ALADIN iteration. For d-ADMM such a step is not required as ˆ is explicitly considered in step 1) of Algorithm 6. For warm starting, one can use 0 directly from the previous iteration.

## **4.3 Comparing d-ADMM and d-CG**

Next, we briefly compare important properties of d-ADMM and d-CG.

## **4.3.1 Information exchange**

The required information exchange and the locally maintained variables for d-ADMM (Algorithm 6) and d-CG (Algorithm 7) are graphically illustrated in Figure 4.5. Both algorithms require the same amount of information exchange between neighbors: whereas d-ADMM exchanges the matrix-vector product

Figure 4.5: Information exchange and local variables for d-ADMM and d-CG.


Table 4.1: Properties of d-CG and d-ADMM for problem (4.12).

 , d-CG exchanges the matrix-vector product . A key difference in d-CG is the additionally required global communication of the two scalars and per iteration. Moreover, for d-CG, one has to maintain four variables per node, whereas d-ADMM requires to maintain two variables per node.

## **4.3.2 Convergence properties**

d-ADMM and d-CG exhibit different convergence properties. The conjugate gradient algorithm (theoretically) yields the *exact* solution in at most steps [NW06] which is a very strong guarantee. However, in practice convergence is typically slower as conjugate gradient is quite sensitive to numerical round-off errors coming from finite precision arithmetic. Nonetheless one can observe superlinear convergence [BK01]. The convergence rate of ADMM is either sublinear for convex problems or linear for strongly convex problems, cf. Chapter 3. A further advantage of d-CG compared with d-ADMM is that no tuning of the step size is needed as this is done "automatically" in step 1) and step 4) of Algorithm 7. The properties of d-CG and d-ADMM are summarized in Table 4.1.

*Remark 13 (Different version of d-ADMM in [Eng+20])* In [Eng+20] we used a different version of d-ADMM. A major difference is that in the present version, redundant variables in (4.12) are eliminated and thus all are strongly convex. Hence, in the present case, we obtain linear convergence (cf. [Yan+16; WO13; NOR18]) Moreover, note that the derivation of D-ADM and d-CG in [Eng+20] is different as we do not use the matrices there for encoding sparse communication.

## **4.4 Bi-level distributed ALADIN in summary**

The overall *bi-level distributed ALADIN* algorithm is summarized in Algorithm 8. Observe the similarity to basic ALADIN in Algorithm 5. There are two main differences between these two algorithms: First, instead of computing full sensitivities like gradients, Hessians, Jacobians in step 1) of Algorithm 5, one computes *Schur complements* ˜ and ˜ by (4.10) reducing the amount of reuquired communication compared to Algorithm 5. Moreover, in step 2), the condensed coordination step is computed in a decentralized fashion leading to an overall *decentralized algorithm*. The residual ∥ ∥ quantifies the "amount of inexactness" in the coordination step due to decentralized conjugate gradients and decentralized ADMM and will be defined in the next subsection.

## **4.5 Convergence analysis**

Recall that d-CG and d-ADMM solve the coordination QP (4.1) with a *finite precision only*. Thus, the convergence analysis from [HFD16] can not directly be used for bi-level ALADIN. In the following we show, that the favorable convergence properties of ALADIN can be preserved if the error in the inexact solution to (4.1) and the error due to Hessian approximation decays fast enough. For doing so, we use techniques from inexact Newton methods [DES82; HFD16]. Our analysis contains the analysis of basic ALADIN as a special case. The overall proof is outlined in Figure 4.6. The main idea is to

### Algorithm 8. Bi-level distributed ALADIN

Initialization: Initial guess ( 0 , <sup>0</sup> ), parameters Σ ≻ 0, , > 0. Repeat:

1) Solve in *parallel* for all ∈ R

$$\mathbf{x}\_{i}^{k} = \underset{\mathbf{x}\_{i} \in \mathcal{X}\_{i}}{\text{argmin}} \quad f\_{i}(\mathbf{x}\_{i}) + (\boldsymbol{\lambda}^{k})^{\top} \boldsymbol{A}\_{i} \mathbf{x}\_{i} + \frac{\nu}{2} \left\| \mathbf{x}\_{i} - \boldsymbol{z}\_{i}^{k} \right\|\_{\Sigma\_{i}}^{2} \tag{4.22}$$

and compute *condensed* sensitivities ˜ and ˜ .


combine the property that the local problems in step 1) of ALADIN form a Lipschitz continuous map in terms of ( , ) and step 2) can be viewed as an SQP step making results from Newton-type methods for fast local contraction applicable (cf. Section 2.1.2).

#### **Lipschitzness of the parallel step**

Let us start by analyzing the Lipschitz property of the parallel step 1) in basic ALADIN (Algorithm 5). This step is important mainly because of two reasons: It minimizes the local NLPs while at the same time staying close to the previous iterate , where the distance to can be controlled by . The hope is that the minimizer of the NLP brings us closer to a local minimizer (although this is theoretically not proven so far, but often it does in practice). The second and maybe even more important feature of this step is that it determines an active set A ( ) for which the constraint linearizations of the inequality constraints are held constant in the coordination QP. This renders the coordination QP an *equality* constrained QP instead of a mixed equality-inequality constrained QP, which is substantially cheaper to solve.

Next, we use standard results from parametric programming to show that the mapping formed by step 1) of basic ALADIN is Lipschitz continuous with respect to ( , ). We will then use this result to show show overall fast linear

Figure 4.6: Outline of the bi-level ALADIN proof.

to quadratic local convergence of bi-level ALADIN. The theorem is originally from [Rob80, Thm 4.1], but here stated in form of [FI90, Thm 5.2].

*Theorem 5 (Lipschitz property for parametric programming)* Consider the parametric programming problem

$$\min\_{\mathbf{x}\in\mathbb{R}^n} f(\mathbf{x}, p) \quad \text{subject to} \ g(\mathbf{x}, p) = 0 \quad \text{and} \ h(\mathbf{x}, p) \le 0. \tag{4.23}$$

Let LICQ (Definition 2) and the conditions of Theorem 2 (SOSC) and hold at ( ★, ¯) with ★ > 0 (strict complementarity), where ★ is the solution to (4.23) for a given = ¯. Then there exists a < ∞ such that in a neighborhood of ¯ it holds true that

$$\|\|\mathbf{x} - \mathbf{x}^\star\|\| \le \chi \|\|p - \bar{p}\|\|. \tag{4.24}$$

By defining <sup>⊤</sup> := ( ⊤, <sup>⊤</sup> ) it is obvious to see that the local problems in step 1) in Algorithm 5 are in form of (4.23). Thus, if we are able to ensure that LICQ, SOSC and strict complementarity hold at ★ , we can use (4.24) in our local convergence analysis. Hence, we have to ensure that the Hessian of the Lagrangians sufficiently positive definite to ensure that SOSC holds. This can be ensured by choosing and Σ large enough/sufficiently positive definite such that locally

$$
\nabla\_{\times \times}^2 \left( f\_i(\mathbf{x}\_i^k) + \kappa\_i^{k \top} h\_i(\mathbf{x}\_i^k) \right) + \nu \Sigma\_i \succ 0 \tag{4.25}
$$

is satisfied. This corresponds to the statement in [HFD16, Lemm 3].

#### **Coordination is a Newton-type step**

Next, we analyze the progress towards a local minimizer in the coordination QP (4.1). We evaluate the first-order KKT conditions (2.2) for the coordination QP which is a linear system of equations. Then we compare this linear system to the KKT system from an Newton-type step applied to the KKT conditions of problem (2.12) to highlight their similarity. This will make convergence results from Newton-type methods (Theorem 4) applicable yielding local quadratic convergence of ALADIN.

Let us start by evaluating the KKT conditions for the coordination QP (4.1). Note that as ≻ 0 on the nullspace of , the coordination QP (4.1) is strongly convex and thus the KKT conditions are necessary and sufficient for solving (2.25) (cf. Section 2.1.1). The Lagrangian to (4.1) is

$$\begin{split} \mathcal{L}(\Delta \mathbf{x}, \mathbf{s}, \tilde{\kappa}, \lambda) &= \frac{1}{2} \Delta \mathbf{x}^{\top} \mathbf{B}^{k} \Delta \mathbf{x} + \nabla f(\mathbf{x}^{k})^{\top} \Delta \mathbf{x} + \lambda^{k\top} \mathbf{s} + \frac{\rho}{2} ||\mathbf{s}||\_{2}^{2} \\ &+ \tilde{\kappa}^{\top} \mathbf{C}^{k} \Delta \mathbf{x} + \lambda^{\top} \left( A \left( \mathbf{x}^{k} + \Delta \mathbf{x} \right) - b - s \right), \end{split}$$

where ˜ <sup>⊤</sup> = ˜ ⊤ 1 , . . . , ˜ ⊤ are the Lagrange multipliers associated with (4.1b). Here we use concatenated notation for all variables, i.e. Δ <sup>⊤</sup> = Δ ⊤ 1 , . . . , Δ ⊤ ,  = (1, . . . , ), := diag∈R ( ), := diag∈R ( ) and ∇ ( ) <sup>⊤</sup> = ∇ <sup>1</sup> ( 1 ) <sup>⊤</sup>, . . . , ∇ ( ) ⊤ . Thus, the KKT conditions read

$$\nabla\_{\Delta \mathbf{x}} \mathcal{L}(\Delta \mathbf{x}, \mathbf{s}, \tilde{\kappa}, \lambda) = B^k \Delta \mathbf{x} + \nabla f(\mathbf{x}^k) + A^\top \lambda + C^{k\top} \tilde{\kappa} \qquad = \mathbf{0}, \qquad (4.26a)$$

$$\nabla\_s \mathcal{L}(\Delta \mathbf{x}, \mathbf{s}, \tilde{\kappa}, \lambda) = \lambda^k + \rho \mathbf{s} - \lambda \qquad \qquad = \mathbf{0}, \qquad (4.26 \mathbf{b})$$

$$\nabla\_{\vec{k}} \mathcal{L}(\Delta \mathbf{x}, \mathbf{s}, \tilde{\kappa}, \lambda) = C^k \Delta \mathbf{x} \tag{4.26c} = 0,\qquad(4.26c)$$

$$\nabla\_{\lambda} \mathcal{L}(\Delta \mathbf{x}, \mathbf{s}, \tilde{\kappa}, \lambda) = A \left( \mathbf{x}^{k} + \Delta \mathbf{x} \right) - b - \mathbf{s} \qquad \qquad = \mathbf{0}. \tag{4.26d}$$

Rearranging (4.26b) yields = 1 ( − ). Eliminating in (4.26d), defining Δ := − , and expanding (4.26a) with ± ⊤ and ± <sup>⊤</sup>˜ reveals that

$$\begin{aligned} \nabla\_{\Delta \mathbf{x}} \mathcal{L}(\Delta \mathbf{x}, \mathbf{s}, \tilde{\kappa}, \lambda) &= B^k \Delta \mathbf{x} + \nabla f(\mathbf{x}^k) + A^\top \Delta \lambda + A^\top \lambda^k + C^{k\top} \Delta \kappa + C^{k\top} \kappa &= 0, \\ \nabla\_{\tilde{\kappa}} \mathcal{L}(\Delta \mathbf{x}, \mathbf{s}, \tilde{\kappa}, \lambda) &= C^k \Delta \mathbf{x} &= 0, \end{aligned}$$

$$\nabla\_{\lambda} \mathcal{L}(\Delta \mathbf{x}, \mathbf{s}, \tilde{\kappa}, \lambda) = A \left( \mathbf{x}^{k} + \Delta \mathbf{x} \right) - b - \rho^{-1} \Delta \lambda \qquad \qquad = \mathbf{0}.$$

with Δ = ˜ − . Thus,

$$\underbrace{\begin{pmatrix} B^k & A^\top & C^{k\top} \\ A & -\frac{1}{\rho}I & 0 \\ C^k & 0 & 0 \end{pmatrix}}\_{=:\mathcal{M}\left( (\boldsymbol{\kappa}^\tau, \boldsymbol{\kappa}^\tau)^\top \right)} \begin{pmatrix} \Delta\boldsymbol{x} \\ \Delta\boldsymbol{\lambda} \\ \Delta\boldsymbol{\kappa} \end{pmatrix} = \underbrace{\begin{pmatrix} -\nabla f(\boldsymbol{\kappa}^k) - A^\top \boldsymbol{\lambda}^k - C^{k\top} \boldsymbol{\kappa} \\ -A\boldsymbol{\kappa}^k + b \\ 0 \end{pmatrix}}\_{=:\mathcal{M}\left( (\boldsymbol{\kappa}^\tau, \boldsymbol{\kappa}^\tau, A^\tau)^\top \right)}. \tag{4.28}$$

The dependence of on stems from the dependence of ∇ 2 L (, , ) on (in case of exact Hessians).

### **Similarity to an SQP step for problem** (2.12)

The Lagrangian of the affinely-coupled separable problem (2.12) reads

$$
\mathcal{L}(\mathbf{x}, \kappa\_E, \kappa\_I, \lambda) = f(\mathbf{x}) + \kappa\_E^\top \mathbf{g}(\mathbf{x}) + \kappa\_I^\top h(\mathbf{x}) + \lambda^\top (A\mathbf{x} - b),
$$

where = Í ∈R , ⊤ = ( ⊤ 1 , . . . , ⊤ ), ⊤ = ( ⊤ 1 , . . . , ⊤ ), <sup>⊤</sup> = ( ⊤ 1 , . . . , ⊤ ), ℎ <sup>⊤</sup> = (ℎ ⊤ , . . . , ℎ⊤ ) and = (1, . . . , ). Then, the KKT conditions read

$$\begin{aligned} F(q^\star) &:= \nabla \mathcal{L}(\boldsymbol{x}^\star, \kappa\_E^\star, \kappa\_I^\star, \boldsymbol{\lambda}^\star) \\ &= \begin{pmatrix} \nabla f(\boldsymbol{x}^\star) + \nabla g(\boldsymbol{x}^\star)^\top \kappa\_E^\star + \nabla h(\boldsymbol{x}^\star)^\top \kappa\_I^\star + \boldsymbol{A}^\top \boldsymbol{\lambda}^\star \\ \boldsymbol{A} \boldsymbol{x}^\star - \boldsymbol{b} \\ \boldsymbol{g}(\boldsymbol{x}^\star) \\ (h(\boldsymbol{x}^\star))\_{\mathcal{H}(\boldsymbol{x}^\star)} \end{pmatrix} = \boldsymbol{0}, \end{aligned} (4.29)$$

where <sup>⊤</sup> := ( <sup>⊤</sup>, ⊤, ⊤) with <sup>⊤</sup> := ( ⊤ , ⊤ ). An exact Newton step applied to (4.29), ∇( )Δ = ( ), yields

$$\begin{aligned} \nabla\_{\boldsymbol{x}\boldsymbol{x}}^{2} \mathcal{L}(\boldsymbol{x}^{k}, \boldsymbol{k}\_{E}^{k}, \boldsymbol{\lambda}\_{I}^{k}, \boldsymbol{\lambda}^{k}) &= \begin{pmatrix} \nabla\_{\boldsymbol{x}\boldsymbol{x}}^{2} \mathcal{L}(\boldsymbol{x}^{k}, \boldsymbol{k}\_{E}^{k}, \boldsymbol{\lambda}^{k}) & \boldsymbol{A}^{\top} & \nabla\_{\boldsymbol{\mathcal{S}}} (\boldsymbol{x}^{k})^{\top} & \nabla \left(h(\boldsymbol{x}^{k})\right)^{\top}\_{\mathcal{H}(\boldsymbol{x}^{\star})} \\ \boldsymbol{A} & \boldsymbol{0} & \boldsymbol{0} & \boldsymbol{0} \\ \nabla \boldsymbol{g}(\boldsymbol{x}^{k}) & \boldsymbol{0} & \boldsymbol{0} & \boldsymbol{0} \\ \nabla \left(h(\boldsymbol{x}^{k})\right)\_{\mathcal{H}(\boldsymbol{x}^{\star})} & \boldsymbol{0} & \boldsymbol{0} & \boldsymbol{0} \end{pmatrix} \begin{pmatrix} \Delta \boldsymbol{x} \\ \Delta \boldsymbol{l} \\ \Delta \boldsymbol{\kappa}\_{E} \\ \Delta \boldsymbol{\kappa}\_{I} \end{pmatrix} \\ &= \begin{pmatrix} -\nabla f(\boldsymbol{x}^{k}) - \nabla g(\boldsymbol{x}^{k})^{\top} \boldsymbol{\kappa}^{k}\_{E} - \nabla h(\boldsymbol{x}^{k})^{\top} \boldsymbol{\kappa}^{k}\_{I} - \boldsymbol{A}^{\top} \boldsymbol{\lambda}^{k} \\ -A\boldsymbol{x}^{k} + \boldsymbol{b} \\ -\mathsf{g}(\boldsymbol{x}^{k}) \\ -\left(h(\boldsymbol{x}^{k})\right)\_{\mathcal{H}(\boldsymbol{x}^{\star})} \end{pmatrix} . \end{aligned} \tag{4.30}$$

By comparing (4.30) with (4.28) one can see that for exact Hessians = ∇ 2 L and Jacobians <sup>⊤</sup> = ∇( ) <sup>⊤</sup> ∇ ℎ( ) ⊤ A (★) , both equations coincide apart from the term − 1 . For → ∞ both equations coincide exactly. Hence, the coordination step of ALADIN can be interpreted as an *inexact Newton step* to (2.12). Note that Theorem 4 allows for such an "inexactness" in the compatibility condition (2.10a), but the convergence rate depends on whether this inexactness vanishes asymptotically.

#### **Overall convergence under inexact coordination**

Now let us use Theorem 4 to prove local convergence of ALADIN. For doing so, we have to check whether the conditions of Theorem 4 are satisfied. The Lipschitz condition (2.10b) holds by assuming Lipschitzness of the gradients and Hessians/Jacobians in , and ℎ in (2.12) and by assuming that LICQ and SOSC are satisfied (to exclude the case that the KKT matrix is rank-deficient at ★). For the compatibility condition (2.10a), let us assume that we use exact Hessians = ∇ 2 L ( , , ). Then, the compatibility condition reads (by using the left-hand sides of (4.30) and (4.28))

$$\gamma^k := \left\| \left( M^k(\rho^k) \right)^{-1} (M^k(\rho^k) - \nabla F(q^k)) \right\| = \left\| \left( M^k(\rho^k) \right)^{-1} \right\| \left\| 1/\rho^k \cdot I \right\| \left\| \right\|.$$

Since ( ) → ∇( ) for → ∞, ( ( ))−<sup>1</sup> converges to a fixed number and 1/ converges to zero which means that we can make arbitrarily small (and especially smaller than one) as required by Theorem 4. Thus, the conditions of Theorem 4 are satisfied<sup>3</sup> and we can use the contraction inequality (2.9) in our following convergence analysis.

Apart from the inexactness introduced by the term <sup>1</sup> from before, we have a second kind of inexactness: the inexactness coming from an inexact step computation from the before-mentioned two inner algorithms d-ADMM and d-CG. Thus, we include this second source of inexactness in our convergence analysis. For doing so, we use ideas from inexact Newton methods [DES82]. Let us denote the primal-dual iterate after the coordination step 3) of basic ALADIN (Algorithm 5) with <sup>⊤</sup> = ( ⊤, ˜ <sup>⊤</sup>, ⊤) and let us denote an *inexact primal-dual iterate* to step 3) by ¯.

We analyze the effect of the inexactness. The distance of the inexact iterate ¯ +1 to a local primal-dual solution ★ reads

$$\begin{aligned} \left\| \bar{q}^{k+1} - q^\star \right\| \leq \left\| \bar{q}^{k+1} - q^{k+1} \right\| + \left\| q^{k+1} - q^\star \right\| \leq \\ \left\| \bar{q}^{k+1} - q^{k+1} \right\| + \frac{\omega}{2} \left\| p^k - p^\star \right\|^2 + \gamma \left\| p^k - p^\star \right\|, \quad (4.31) \end{aligned}$$

where we expanded with ± +1 , used the triangular inequality and the Newtontype contraction from Theorem 4, where we denote the primal-dual iterate after step 1) of basic ALADIN with <sup>⊤</sup> = ( <sup>⊤</sup>, ⊤, ⊤). In order to preserve convergence, we need to bound ¯ +<sup>1</sup> − +1 in (4.31) representing the "inexactness"

<sup>3</sup> We consider local convergence here. Thus, the condition ∥ − ★∥ < 2(1−) has to be satisfied by a suitable primal-dual initialization.

of the solution to (4.1). In order to do so, let us define the residual to the KKT system (4.28) as

$$r\_q^k \coloneqq M(p^k) \left(\tilde{q}^{k+1} - p^{k+1}\right) - m(p^k, \lambda^k). \tag{4.32}$$

Since ( ) ( +<sup>1</sup> − +1 ) − ( , ) = 0 by definition of +1 , we have

$$\begin{aligned} \|\bar{q}^{k+1} - q^{k+1}\| &= \| (M(p^k))^{-1} (r\_q^k + m(p^k, \lambda^k) - m(p^k, \lambda^k)) \| \\ &\le \| (M(p^k))^{-1} \| \| r\_q^k \| . \end{aligned}$$

Combination with (4.31) and defining := ∥( ) −1 ∥ yields

$$\left\| \left\| \bar{q}^{k+1} - q^{\star} \right\| \right\| \leq \beta \| r\_q^k \| + \frac{\omega}{2} \| p^k - p^{\star} \| ^2 + \gamma \left\| p^k - p^{\star} \right\|.$$

By assuming LICQ, strict complemntarity and SOSC we know that by Theorem 5 that the mapping formed by the local problems in ALADIN Lipschitz with constant . Thus, we get

$$\left\| \left\| \bar{q}^{k+1} - q^{\star} \right\| \right\| \leq \left( \beta \left\| r\_q^k \right\| + \gamma^k \right) \chi \left\| \left\| \bar{q}^k - q^{\star} \right\| + \frac{\omega}{2} \left\chi^2 \left\| \bar{q}^k - q^{\star} \right\|^2 \right\| $$

Enforcing a bound on the inaccuracy in the solution to the coordination QP stemming from inner algorithms

$$\|\|r\_q^k\|\| \le \eta^k \|\|m(p^k, \lambda^k)\|\|\tag{4.33}$$

yields

$$\left\|\bar{q}^{k+1} - q^{\star}\right\| \le \left(\beta \,\delta \,\eta^{k} + \gamma^{k}\right) \chi \left\|\bar{q}^{k} - q^{\star}\right\| + \frac{\omega}{2} \,\chi^{2} \left\|\bar{q}^{k} - q^{\star}\right\|^{2},\qquad(4.34)$$

where is a Lipschitz constant of .

Note that if the sequences { } and { } are bounded by constants , small enough such that ∥¯ − ★∥ < 2(1 − − ) ( <sup>2</sup> ) −1 is satisfied in each iteration, linear convergence of ALADIN is guaranteed. By additionally enforcing bounds on the sequences { } and { }, i.e. = (∥¯ − ★∥) and = (∥¯ − ★∥) we can make the convergence quadratic. Alternatively, enforcing , → 0 yields superlinear convergence.

We summarize our results from this section in the following Theorem.

## *Theorem 6 (Convergence of bi-level distributed ALADIN)*

Consider bi-level distributed ALADIN (Algorithm 8) with Σ and such that (4.25) holds. Suppose Assumption 2 and the conditions of Theorem 2 hold. Let the solution to the condensed QP (4.9) satisfy (4.33) for a bounded sequence { }∈<sup>N</sup> < and let <sup>0</sup> be close enough to a primal-dual minimizer ★.

Then bi-level distributed ALADIN converges to ★


Theorem 6 follows directly from inequality (4.34) and the definitions of linear and quadratic convergence (cf. Section A.1).

In basic ALADIN, the coordination problem (4.9) is solved exactly and thus ≡ 0. Thus, basic ALADIN can be seen as a *special case of bi-level ALADIN*. Moreover, the convergence analysis given here can be extended to the case where the the local NLPs are solved with a finite precision only [Eng+19b] using similar techniques as we did here. We omit this extension as it adds more technical complications without adding much insight. Moreover, a BFGS-ALADIN variant we presented in [Eng+19b] is included in this framework as a particular choice of . We do not consider this variant here.

*Remark 14 (KKT matrix modifications)* Note that the compatibility condition (2.10a) and the corresponding requirements on { } cover a variety of modifications of the KKT matrix —not only the modifications with respect to <sup>1</sup> . The error due to Hessian approximations like BFGS or the error due to regularization procedures for ensuring positive definiteness of can be further sources of inexactness, or errors because of approximations in the constraint linearizations .

*Remark 15 (Evaluation of the error condition* (4.33)*)* In order to guarantee local convergence one has to ensure that (4.33) holds for a suitable sequence { } < . Intuitively this means, that the residual ∥ ∥ has to become smaller as ALADIN proceeds and thus the accuracy in the coordination step has to get higher as we get closer to a local minimizer as ( , ) → 0 for  → ★. But evaluating ∥ ∥ at each iteration is expensive and yields additional overhead, which one would like to avoid. There is an option to overcome this limitation: instead of evaluating ∥ ∥ one can equivalently evaluate the residual in the reduced system (4.9), . This can be seen as follows. As we use the nullspace method, the last row of in (4.28) is always zero by = 0. Hence, it remains to evaluate the first and second row of (4.28). Again, by using the nullspace-parametrization Δ = Δ from Section 4.1, one can *enforce* a zero residual in the first row (by multiplying with <sup>⊤</sup> from the left and exploiting = 0)

$$
\bar{B}^k \Delta \nu + \bar{g}^k + \bar{A}^k \bar{\lambda}^{k+1} = 0,
$$

where we recall that ¯+<sup>1</sup> denotes the inexact version of +1 . Inserting into the second row yields the residual

$$r r\_{\lambda}^{k} = (\bar{A}^{k} (\bar{B}^{k})^{-1} \bar{A} + \frac{1}{\rho} I) \lambda^{k+1} + \bar{A}^{k} (\bar{B}^{k})^{-1} \bar{g} - \bar{A} \nu^{k} + \frac{1}{\rho} \lambda^{k} + b = \tilde{S} \lambda^{k+1} - \tilde{s}.$$

With that, we have = (0 ⊤ 0 ⊤ ⊤ ) ⊤ and thus ∥ ∥ = ∥ ∥. Hence, we can use for evaluating (4.33) instead of .

## **4.6 Comparison of variants**

Next, we compare the two developed bi-level variants with basic ALADIN, ALADIN with condensing and ADMM as one of the most prominent primaldual methods. We denote bi-level ALADIN with ADMM with b-ADMM in the following, and bi-level ALADIN with conjugate gradients with b-CG. Our evaluation mainly focuses on communication effort, coordination effort, convergence guarantees and practical convergence.

#### **Communication**

As reducing communication was one of our main motivations for developing bi-level ALADIN, let us start by analyzing the amount of communication. We analyze communication by counting the number of floating point numbers (floats) which have to be exchanged per iteration.

We begin by analyzing basic ALADIN. Here we have to communicate all components necessary to construct problem (4.1), i.e. the Hessian approximations , the gradients of the local objective functions ∇ ( ), the Jacobians and the primal variables . Note that we do not count the communication of and since they stay constant during the iterations and have to be communicated only once. The Hessian approximation is a symmetric matrix, requiring to communicate ( + 1)/2 floats. The Jacobian calls for communicating ( + ℎ) floats, where ℎ is the number of active inequality constraints in outer iteration . Furthermore, the primal iterates and the gradients ∇ ( ) lead to additional floats from each subsystem. In total, this means that we get a forward commincation for basic ALADIN of

$$\sum\_{i \in \mathcal{R}} n\_{\ge i} (n\_{\ge i} + 1)/2 + n\_{\ge i} (n\_{\ge i} + n\_{hi}^k) + 2n\_{\ge i}$$

floats per ALADIN iteration if we do not exploit structural zeros in any of the before mentioned vectors/matrices. In the backward step 3) of ALADIN (Algorithm 5), +1 and +1have to be communicated leading to ( Í ∈R + ) floats in backward communication.

For the condensed ALADIN variant, all subsystems have to communicate their reduced Schur-complement matrices ˜ ∈ R | C () |× | C () | and vectors ˜ ∈ R | C () |. As ˜ is symmetric, this leads to a forward communication of

$$\sum\_{i \in \mathcal{R}} |\mathcal{C}(i)| \left( |\mathcal{C}(i)| + 1 \right) / 2 + |\mathcal{C}(i)|.$$

In the backward step only +<sup>1</sup> has to be communicated leading to floats in backward communication.

For b-ADMM (Algorithm 6), the matrix vector product +1 has to be communicated in each inner iteration. As the total number of non-zero entries in is equal to the number of consensus constraints assigned to subsystem ∈ R, we have |C ()| floats for for each inner ADMM iteration. Thus, we need to communicate Í ∈R |C ()| in each d-ADMM iteration on a neighbor-toneighbor basis. This yields a total forward communication of AD Í ∈R |C ()| floats per outer ALADIN iteration, where AD is the number of d-ADMM iterations. As all are known locally already during the d-ADMM iterations,



no backward communication is needed. Furthermore, for d-ADMM there is no need for global communication.

Similar to d-ADMM, in d-CG the matrix-vector products have to be communicated leading to local communication of CG Í ∈R |C ()| floats, where CG is the number of inner d-CG iterations. Additionaly, d-CG needs the global communication of two floats per subsystem and per inner d-CG iteration leading to 2CG floats. Also for d-CG no backward communication is required as is already known locally.

For ADMM directly applied to (2.12), there are variants similar to Algorithm 4 requiring the exchange of the coupling variables only [Boy+11]. This leads to a communication overhead of Í ∈R |C ()| floats per ADMM iteration. From this, at first glance it seems that ADMM outperforms the ALADIN variants by far in terms of communication. However, keep in mind that due to its weak convergence rate guarantees, ADMM might need much more iterations in total to achieve the same accuacy compared in ALADIN—so for evaluating the *total amount of communication* one has to multiply these numbers by the numbers of iterations needed. We will investigate this effect numerically in Section 5.6.


Table 4.3: Bi-level ALADIN compared to distributed optimization algorithms.

### **Convergence properties and rates**

As shown in Theorem 6, bi-level has similar (local) convergence properties than basic ALADIN with one difference: the convergence rate here depends on the accuracy in the coordination step and on the quality of the Hessian approximations. If the accuracy increases and the approximation error decreases fast enough in terms of the distance to a local minimizer, quadratic convergence rate can be preserved.

The properties of bi-level ALADIN compared with primal, primal-dual and internal decomposition methods are summarized in Table 4.3. Here, one can see that with bi-level ALADIN we overcome the two main limitations of basic ALADIN:


At the same time, the favorable convergence properties of ALADIN are to a large extend preserved.

It remains to show how these methods perform in practice. We will do so in the next section with main focus on problems from power systems. A further example from distributed optimal control is given in Section 6.6.

# **4.7 Summary and conclusion**

In this section we presented a framework called bi-level ALADIN, which is one of the first decentralized optimization algorithms for non-convex optimization with fast local convergence guarantees. In bi-level ALADIN, we use methods from nonlinear programming such as the nullspace method and the Schurcomplement to substantially reduce the dimension of the coordination QP of basic ALADIN. Moreover, we showed how to exploit problem structure to solve the coordination problem in a decentralized fashion making bi-level ALADIN a decentralized algorithm. We presented two algorithms for doing so—a variant of ADMM derived similar to other variants in the literature and a decentralized variant of the conjugate gradient method, which is to the best of our knowledge novel. The versions of d-ADMM and d-CG given here are improved versions of [Eng+20], where ADMM is guaranteed to convergence linearly due to a strongly convex formulation of the coordination QP. Moreover, the here used variant of the conjugate gradient method omits the precomputing step from [Eng+20] leading to a substantial improvement in local communication for d-CG. Additionally, we showed that the fast local convergence properties of ALADIN are preserved in bi-level ALADIN, if the error in the coordination step decays fast enough.

# **5 Application to Power Systems**

This chapter evaluates the performance of ALADIN, bi-level ALADIN and ADMM on Optimal Power Flow (OPF) problems—an important problem class from power systems. We start with main motivations for using distributed optimization methods for OPF.

We consider the AC power system model from [GS94; Sch17; FR16; FGM18]. The problem formulation for distributed OPF is from [Eng+19b; Eng+17]. The numerical results of basic ALADIN and ADMM are similar to [Eng+19b; Eng+20]. The analysis of the role of a feasible initial point in Section 5.4.3 is from [EF18]. The analysis of bi-level ALADIN for OPF in Section 5.5 is mainly from [Eng+20]. Parts of the convergence analysis of d-ADMM and the numerical communication analysis are new in this thesis.

## **5.1 How to operate power systems?**

The main purpose of power power systems is to deliver power from generators to power consumers. Thereby, the consumer's demand should be satisfied at any time and to an arbitrary amount. In a classical setting, there are no storage elements. This implies that the generated power has to match the power demand exactly in each step in time. Although power demands can typically not be influenced, there is a certain degree of freedom in power generation. Namely, one can decide which generator takes which share of total active and reactive power demand satisfaction. This can be exploited for a cost-effective operation.

Engineering constraints also have to be considered. Transmission lines for example have thermal limits allowing them to carry a limited current only. Consumer devices and grid components are only guaranteed to work properly in specified ranges around the nominal voltage. Having this in mind, one can ask the following important question: how to choose the generator inputs such that


One of the most promising ways of approaching this question is to encode the above question in an optimization problem called the Optimal Power Flow problem. Optimal power flow computes cost-optimal generator set points while satisfying power demands and technical limitations. We give a brief introduction to OPF in Appendix B.

### **New challenges in today's power system**

For formulating an OPF problem, a grid model is required. Grid models, however, contain sensitive data. This includes the model data itself but also consumption data potentially revealing sensitive information about consumer behavior. Moreover, power system are often required to be ( − 1)-secure, that is, a single point of failure is not acceptable due to data aggregation and the existence of a single point of failure. Moreover, in many countries, there are many grid operators which are cooperatively responsible for operating the grid in an economically optimal and safe fashion. In Germany for example, there are four transmission system operators and more than 800 distribution system operators [Bun19], which have to coordinate in some way. The map of the German distribution grid operators, Figure 5.1, graphically illustrates the dimension of this problem. Here, the question arises how to coordinate these grid operators.

Distributed and especially decentralized optimization algorithms seems to be an excellent fit for addressing the above challenges. Distributed optimization offers systematic coordination while reducing complexity by shifting computation overhead to the subsystems (grid operators). At the same time the information exchange and centralized coordination are limited.

Figure 5.1: Distribution grid operators in Germany [ene20].

## **5.2 Distributed optimal power flow**

In order to apply distributed and decentralized optimization algorithms such as ADMM and (bi-level) ALADIN, we have to express the OPF problem mathematically in structured form (1.2),

$$\min\_{\mathbf{x}} \sum\_{i \in \mathcal{R}} f\_i(\mathbf{x}\_i) \quad \text{subject to} \quad \mathbf{x}\_i \in \mathcal{X}\_i, \quad i \in \mathcal{R} \quad \text{and} \quad \sum\_{i \in \mathcal{R}} A\_i \mathbf{x}\_i = b. \tag{5.1}$$

Here, the set R represents the set of regions or system operators, encodes the cost of power generation in each region, and X captures the grid model and technical limitations in each region.

Due to renewable generation, today electrical grids are often operated close to their capacity limits. Thus, one has to consider grid models capturing effects which occur in this case such as the effect of reactive power and voltage limits. The most common model considering these effects is the so-called AC model, which we recall in Appendix B. Therein, we also show how to formulate the AC model in form of (5.1). Solving (5.1) via distributed or decentralized algorithms is called distributed optimal power flow.

## **5.3 Literature review**

Next, we provide an overview on main lines of research for distributed OPF. In Chapter 3 we have seen, that most distributed and decentralized optimization algorithms are designed for *convex problems*. Moreover, even if they can handle non-convexity, this is typically in the objective function but not in the constraints. Unfortunately, in AC optimal power flow problems we have *nonconvex* constraints (cf. Appendix B). To cope with this issue, one can identify four main lines of research in the literature:


HPK02; SPG13]. Although they work well in many cases, convergence can not be guaranteed and their convergence rate is often slow as we will see later in this section.


### **Internal decomposition methods**

Let us start with internal decomposition methods as they were among the first decomposition methods applied to power flow and also to optimal power flow problems. Note that as outline in section Chapter 3, these methods were mainly designed for computational speedup coming from limited computation power at that time. One of the first works [FP78] considers a combination of clustering techniques and block elimination in the KKT system. This line of research was extensively investigated in the 1980s, e.g. for power system state estimation [WHB81] even in asynchronous schemes [TPG83]. Advanced structure-exploiting factorization techniques are used in [VNM83; BA86; SAY87; ETA90; OKS90; SVN93], where most of these works are summarized in the technical report [Com92]. Some of these block-elimination techniques have a similar flavor as the reduction methods we used in Section 4.1 for bi-level ALADIN, especially the works [FP78; SAY87]. Although these methods are transferable to OPF, in their original form they typically address power flow or state-estimation problems. More recent examples for internal decomposition in interior point methods can be found in [Yan+11] and [LT18], where in [LT18] the authors yield very promising numerical results also for large-scale systems. Note that all of the before-mentioned methods still require central coordination for solving the reduced KKT system and also for other operations such as barrier-parameter updates. Thus, these methods offer only a limited degree of distribution.

A second branch works on decomposing the KKT conditions instead of the decomposing the KKT system. The main idea is to hold all variables which are not involved in the corresponding subsystem fixed and to solve the resulting simplified subproblems independently from each other. The method doing so is called Optimality Condition Decomposition (OCD), first proposed in [CNP02; Con+06] and further investigated for OPF in [NPC03; HA09]. In [CNP02], the convergence rate of OCD is shown to be linear under certain technical conditions. One weakness of the above methods is—although they might work well in certain practical problems—that the convergence of these methods seems not to be fully analyzed at present. In [CNP02], a sufficient condition for convergence of OCD to a first-order stationary point is developed. However, to the best of our knowledge, it remains unclear whether this condition holds for arbitrary OPF problems [Ers14a]. A similar approach based on decomposing the KKT conditions without tuning parameters can be found in [BB06].

*Remark 16 (Relation to classical Gauss-Seidel methods)* Note that also the classical Gauss-Seidel method for power flow problems [HO93; Glo11] can be considered as a nodally "distributed algorithm". Therein, each node computes individual voltage updates based on the current iterates of neighbored nodes. However, as the non-convexity of the power flow equations suggests, also the classical literture indicates that the Gauss-Seidel method often diverges [Glo11, Chap. 6.5].

### **Primal-dual algorithms**

We continue with (convex) primal-dual optimization algorithms based on augmented Lagragians directly applied to the non-convex AC OPF. Early primaldual methods use dual decomposition [AQC99], ADMM [WSL01], and related augmented Lagrangian methods such as the auxiliary problem principle (APP) [Bal+99; HPK02; AKA07] for solving the OPF problem in a distributed fashion. APP is evaluated against other augmented Lagrangian methods like ADMM and a Predictor-Corrector Proximal Multiplier Method in [KB00]—all of them showing a comparable performance. This line of research continues until today—especially ADMM [Ers14b], with recently refined parameterupdate schemes in [Ers15]. Although ADMM converges to modest accuracy for medium-sized grids (similar to many other augmented Lagrangian methods), recent attempts on large-scale grids indicate that convergence becomes increasingly difficult for with increasng problems size and might work only with special partitioning techniques [GHT17]. Moreover, all of these methods are in general not guaranteed to converge and divergence seems to occur in practice [Chr+17a]. If they converge, they achieve at maximum a linear convergence rate. But the convergence modulus can be quite large as we shall see later. Moreover, if an increasing sequence of the penalty parameter in combination with a feasible initialization is used (as done for example in [GHT17]), one can show that these methods converge to a feasible but not necessarily optimal point (cf. Section 5.4.3). One strength of augmented-Lagrangian and similar methods is that they seem to converge quite stable in terms of converging to modest accuracy for OPF—at least up to medium-sized grids. Recently, an augmented-Lagrangian based scheme for radial grids with guarantees for the AC model is presented in [Liu+17].

#### **Convex relaxations and approximations**

A different approach uses convex *relaxations* and *approximations* of the power flow equations yielding convex problems for which the before presented algorithms are—in contrast to the full AC-OPF problem—guaranteed to converge. A classical approach for doing so is based on the DC power flow approximation [GS94] to which augmented Lagrangian-based approaches have been applied in [LHA94; CA98; BB03; Fei+15]. The advantage of this approach is that classical augmented Lagrangian methods converge with guarantees and with modest communication requirements for the relaxed problems. However, important physical quantities such as voltage magnitudes and the reactive power are not considered in the DC model making it non-applicable for many practical applications. Especially for distribution grids—which is one of the most important field of application for modern OPF methods—the accuracy of the DC model can be poor [BZ16; Chr+17a]. Recent approaches combining the DC model with modern distributed algorithms such as ALADIN can be found in [Jia+20; Jia+19]. Another approach for convex modeling of radial grids is the so-called LinDistFlow model [BW89] which is based on the assumption of a flat voltage profile. The resulting convex OPF problem is solved via ADMM in [ŠBC14]. However, in view of the large voltage deviations in distribution grids cause by renewable power injection, this assumptions hardly holds in many practical situations.


Table 5.1: Distributed optimization for optimal power flow—state-of-the-art.

Instead of the DC approximation, a related line of research employs convex *relaxations* of the power flow equations via Semi-Definite Programming (SDP). Therein, the idea is to lift the OPF problem to a higher dimensional space in which it becomes convex if a certain rank constraint is dropped [Bai+08; Mol+13]. This relaxed and inflated problem is then solved by a distributed or decentralized convex optimization algorithm obtaining convergence guarantees [DZG13b; PL17; Chr+17b]. The crux in this approach is that the solution of the relaxed problem might not satisfy the before-mentioned rank condition and thus the solution of the original OPF might not be recoverable from the relaxed problem. Under certain assumptions on the technical equipment such as small transformer resistances or a radial grid topology, e.g. radial grids [LL12; Low14b; Low14a] this "back-mapping" is guaranteed to exist, although these assumptions pose severe limitations on the class of problems to which this approach is applicable. However, there is an ongoing debate whether the assumptions made are realistic [Chr+17a] and, moreover, one can show that the recovering the solution from an SDP relaxation might fail even for very small systems [KDS16]. Furthermore, the number of decision variables when using SDPs is squared leading to very large SDPs (especially for large systems) which can lead to tractability issues [Cap16].

#### **Recent non-convex distributed algorithms**

Recently, new algorithms designed for non-convex distributed nonlinear programming have been proposed which are able to handle the AC model directly and with guarantees without relying on relaxations or approximations. One of them is based on an alternating trust-region approach using a decentralized variant of conjugate gradients and alternating projection methods with convergence guarantees at linear rate for linearly-constrained non-convex problems [HJ17]. The approach was successfully applied to small OPF problems ranging from 9 to 56 buses. A second approach is based on ALADIN which we will present in this chapter [Eng+19b].

An overview of the previously outlined lines of research with main strengths and weaknesses of each line is given in Table 5.1. For more detailed overviews on general and distributed methods for OPF we refer to [Mol+17; Cap16]. Note that we consider ADMM and ALADIN to be in "this thesis" in Table 5.1


Table 5.2: Grid partitioning (excluding auxiliary buses).

in the sense that we compare the performance of these two algorithms in this work.

*Remark 17 (Related OPF problems)* Considering problems with integer variables such as certain versions of the reactive power dispatch problem is beyond the scope of the present work. For first distributed approaches in this direction we refer to [Mur+18; Mur+19]. Moreover, note that although our approaches is also applicable to *temporal* decomposition, we focus on *spatial* decomposition here. Examples for temporal decomposition in OPF with storage or generator ramp constraints can be found for example in [KFS18; AC00; XS01; CCD05]. Therein, similar to internal decomposition methods, the main goal is computational tractability.

Note that these problem formulations have strong interconnections to the theory of *economic model predictive control* [Fau+18] originally developed for the optimal control of time-invariant dynamical systems. First attempts connecting these two worlds and transferring latest methodological developments from this field to power systems are proposed in [FGM18; FE19; Köh+17].

# **5.4 ADMM for Optimal Power Flow**

Now, let us apply distributed optimization algorithms from Section 2.2 to OPF. We start with ADMM for two reasons: first, as outlined in Section 5.3, ADMM is one of the most prominent methods for distributed optimization in general and particularly for distributed OPF [Ers14b; KB00; GHT17]. Secondly, we will use ADMM as a benchmark algorithm for evaluating the performance of ALADIN and the bi-level ALADIN scheme.

(a) IEEE 118-bus system.

(b) Modified 5-bus system from [LB10].

Figure 5.2: Numerical examples.

Throughout this chapter, we use a 5-bus system from [LB10] as a tutorial example and the IEEE 118-bus as a more realistic test system as numerical examples. Both systems are shown in Figure 5.2. The corresponding partitioning are given in Table 5.2. In al cases, we tune ADMM for best performance. Note that the 5-bus system is particularly designed to be hard: we chose a partitioning in which a large amount of power has to be exchanged between region one (as mainly generating region) and the other two regions (as mainly demanding regions). Moreover, we put line limits particularly on lines connecting regions which will lead to difficulties—especially for ADMM as we shall see later. Results for ALADIN and ADMM for further benchmark systems up to 300 buses can be found in [Eng+19b].

## **5.4.1 Typical numerical results**

Next, we show typical numerical results when applying ADMM to OPF problems. Figure 5.3 shows the convergence behavior of ADMM for the 5-bus example from Figure 5.2b for three different values of the tuning parameter ADM ∈ {10<sup>2</sup> , 10<sup>3</sup> , 10<sup>4</sup> } neglecting line limits. The upper subfigures depict the values of all active power injections and reactive power injections over the iteration index ∈ N. <sup>1</sup> To characterize convergence, we consider three different metrics: the consensus violation ∥ ∥<sup>∞</sup> at the current iterate , characterizing the maximum mismatch of active/reactive power or voltage at coupling nodes; the distance to the minimizer ★, ∥ − ★∥∞; 2 the distance of the current objective function value := ( ) to a local minimum ★ := ( ★), and := ∇ ( ) + Í ∈R ⊤ ∇( ) + Í ∈R ⊤ ∇ℎ( ) + ⊤( Í ∈R − ) for (2.12) measuring the degree of stationarity (dual feasibility) in the KKT conditions (2.2a). Note that we use the infinity norm ∥ · ∥<sup>∞</sup> here for scaleindependence.

One can see that with proper tuning of ADM, ADMM is able to reach a medium accuracy of 10−<sup>2</sup> . . . 10−<sup>3</sup> within about 100 iterations. This is typical for ADMM and also observed in many other applications [Boy+11].

<sup>1</sup> We use the standard per-unit (p.u.) normalization system here, cf. [GS94] for details.

<sup>2</sup> We compute the reference solution ★ via the centralized interior-point solver IPOPT.

Figure 5.3: ADMM for the 5-bus system for ADM ∈ {10<sup>2</sup> , 10<sup>3</sup> , 10<sup>4</sup> } .

*Remark 18 (Multiple local minimizers)* Note that as OPF is non-convex, there are possibly multiple local minimizers [Buk+13; LW15]. However, practical experience shows that in most cases there is only one technically meaningful high-voltage solution and local solvers converge to that solution if initialized properly [Cap16; Mom+97]. Here, all algorithms converged to the same local minimum for the standard initialization (every value zero except for the voltage magnitude which is initialized with 1, flat start). The theoretical foundations of existence and uniqueness of solutions to the power flow equations is subject to ongoing and future work [CS19; Sim18; MH19].

#### **Scaling-up to 118 Buses**

Scaling up to large grids is often hard. Different numerical issues are more likely to occur with a higher dimension, for example linearly-dependent constraint linearizations and finding an initial point close to a local minimizer might also be harder. Figure 5.4 shows the numerical performance of ADMM

Figure 5.4: ADMM for the IEEE 118-bus system with ADM ∈ {10<sup>2</sup> , 10<sup>3</sup> , 10<sup>4</sup> }, and g at nodes ∈ {10, 25, 26, 65}.

for the IEEE 118-bus system from Figure 5.2a. One can see that ADMM needs about 150 iteration to converge to a modest accuracy of 10−<sup>2</sup> in the distance to the local minimizer, which is round about twice as much compared to the 5-bus system.

## **5.4.2 The role of special constraints**

ADMM works quite well im many settings such as for the 5-bus or 118-bus OPF problem. However, next we show a practical example, where the convergence to the same level of accuracy is at least a factor of 10 larger compared with the before mentioned cases solely by adding line limits to the OPF problem. Note that this modification does neither change the problem class (non-convex NLP), nor the problem size. The line limits are indicated as red bars in Figure 5.2.

Figure 5.5: ADMM for the 5-bus system for ADM ∈ {10<sup>2</sup> , 10<sup>3</sup> , 10<sup>4</sup> } and active line limits.

The numerical results are shown in Figure 5.5. Here, ADMM needs about 1.500 iterations to reach the same level of accuracy of around 10−<sup>2</sup> .

### **5.4.3 The role of a feasible initialization**

Recent works applying ADMM to OPF problems suggest using increasing sequences for the penalty parameter ADM [Ers15]. In [GHT17] this rule is combine with a feasible initial point for large test systems. Next we investigate the practical and theoretical implications of these modifications.

Figure 5.6 shows the convergence behavior of ADMM for a very large value of ADM = 10<sup>9</sup> with and without initialization at a feasible initialization. One can see that with a feasible initialization, ADMM stays at the initial point which is feasible, but the gap of the cost to the optimal one − ★ is large and does not decay to zero. This means that there is insufficient progress in terms of optimality. This can be explained by the role of the penalization parameter in

Figure 5.6: ADMM for the 5-bus system and ADM = 10<sup>9</sup> , active line limits initialized with (yellow) and without (red) a feasible initial point.

ADMM (Algorithm 4): in case ADM is large, the constraints and ℎ are satisfied and as has full row-rank, ALADIN will return +1 = in step 1) forced by the quadratic penalization term ADM 2 ∥( − ) ∥2 2 . One can think of the following two steps 2) and 3) as a kind of averaging step of the coupling variables, which will also not change if the variables do not change. Hence, ADMM gets stuck at a feasible point in this case.

#### **Mathematical analysis**

Next, we analyze this effect also mathematically for ADMM in case of = 0, which we have for OPF (cf. Appendix B). Note that in the analysis we use instead of ADM for simplified notation.

*Proposition 1 (Feasibility and* → ∞ *implies* +1 − ∈ null()*)* Consider the application of ADMM (Algorithm 4) to problem (2.12) for = 0. Suppose that, for all ∈ N, the local problems in step 1) have unique minimizers fulfilling the LICQ. For ˜ ∈ N, let ˜ be bounded and ˜ ∈ Ð ∈R X ∩ C. Then, the ADMM iterates satisfy ˜+1 = ˜ 

*Proof.* The proof is divided into four steps. Steps 1)-3) establish technical properties used to derive the above assertion in Step 4).

Step 1). At iteration ˜, the local steps of ADMM are

$$\max\_{\lambda\_i \in \mathcal{X}\_i}^{\mathbb{E}} (\rho) = \underset{\mathbf{x}\_i \in \mathcal{X}\_i}{\operatorname{argmin}} \ f\_i(\mathbf{x}\_i) + \left(\lambda\_i^{\mathbb{E}}\right)^{\top} A\_i \mathbf{x}\_i + \frac{\rho}{2} \left\| A\_i \left(\mathbf{x}\_i - \mathbf{z}\_i^{\mathbb{E}}\right) \right\|\_2^2. \tag{5.2}$$

Now, by assumption, all are twice continuously differentiable (hence bounded on X), ˜ is bounded and all ˜ ∈ X . Thus, for all ∈ R, lim →∞ ˜ () = ˜ + ˜ with ˜ ∈ null().

Step 2).The first-order stationarity condition of (5.2) can be written as

$$-\nabla f\_i(\mathbf{x}\_i^{\tilde{k}}) - \gamma\_i^{\tilde{k}\top} \nabla h\_i(\mathbf{x}\_i^{\tilde{k}}) = A\_i^{\top} \boldsymbol{\lambda}\_i^{\tilde{k}} + \rho \boldsymbol{A}\_i^{\top} \boldsymbol{A}\_i \left(\boldsymbol{x}\_i^{\tilde{k}} - \boldsymbol{z}\_i^{\tilde{k}}\right), \tag{5.3}$$

where ˜⊤ is the multiplier associated to ℎ . Multiplying the multiplier update formula from step 3) with ⊤ from the left we obtain ⊤ +1 = ⊤ + ⊤ ( − ). Combined with (5.3) this yields ⊤ ˜+1 = −∇ ( ˜ ) − ˜⊤∇ℎ( ˜ ). By differentiability of and ℎ and regularity of ˜ this implies boundedness of ⊤ ˜+1 .

Step 3). Next, we show by contradiction that Δ ˜ ∈ null() for all ∈ R and → ∞. By expressing as = ˜ + Δ in step 2) of ADMM, the update reads

$$\min\_{\Delta \mathbf{x}} \sum\_{i \in \mathcal{R}} \frac{\rho}{2} \Delta \mathbf{x}\_i^\top A\_i^\top A\_i \Delta \mathbf{x}\_i - \lambda\_i^{\tilde{k}+1\top} A\_i \Delta \mathbf{x}\_i \quad \text{s.t.} \quad \sum\_{i \in \mathcal{R}} A\_i (\mathbf{x}\_i^{\tilde{k}} + \Delta \mathbf{x}\_i) = \mathbf{0}. \quad (5.4)$$

Observe that any Δ ˜ ∈ null() is a feasible point to (5.4) as Í ∈R ˜ = 0 by step 1) of this proof and by feasibility of ˜ . Consider a feasible candidate solution Δ ∉ null() for which Í ∈R ( ˜ + Δ) = 0. Clearly,  ˜+1⊤ Δ() will be bounded. Hence for a sufficiently large value of , the objective of (5.4) will be positive. However, for any Δ ∈ null() the objective of (5.4) is zero, which contradicts optimality of the candidate solution Δ ∉ null(). Hence, choosing sufficiently large ensures that any minimizer of (5.4) lies in null().

Step 4). It remains to show ˜+1 = ˜ . Recall that we used ˜+<sup>1</sup> <sup>=</sup> ˜ + Δ ˜ in the previous step. Given Steps 1-3) this yields ˜+<sup>1</sup> <sup>=</sup> ˜ + ˜ + Δ ˜ and hence

 − ˜+1 2 2 = − ˜ + ˜ + Δ ˜ 2 2 = − ˜ 2 2 .

Observe that this implies that, for → ∞, problem (5.2) does not change from step ˜ to ˜ + 1 by step 3) of Algorithm 4 as ˜+<sup>1</sup> <sup>=</sup> ˜ since by step 3) of this proof also +1 − lies in the nullspace of . This proves the assertion. □

The above proposition shows that the problems in step 1) of Algorithm 4 and also do not change for subsequent iterates, once ADMM is at a feasible point together with → ∞. With the assumption that the local solvers are deterministic, i.e. they yield the same solution for same problem data this implies that ADMM stays at this feasible point, cf. Corollary 1 in [EF18].

#### **Termination by consensus violation and alternating projections**

A second important insight is that a small consensus violation ∥ ∥ does not necessarily imply that the algorithm is close to a local minimizer. Otherwise, ADMM would terminate immediately in case of a feasible initial guess and a high penalization parameter. It tells something of the feasibility of the current step, but the objective function value might be far from being optimal. This is especially relevant here as ∥ ∥ is sometimes used as termination criterion for ADMM for OPF in the literature [Ers15; Ers14a]. In Appendix C we provide two additional analytical examples illustrating the above behavior.

In the opposite case of an infeasible initial guess in combination with a very high penalization, ADMM starts projecting the iterates and back and forth between the consensus constraint set C and the local nonlinear constraints (power flow equations and bounds) X , cf. Algorithm 4. This can for example


Table 5.3: ALADIN parameters

be seen by the large values of ∥ ∥<sup>∞</sup> in Figure 5.6. This is similar to the examples for alternating projections from Section 2.3.1.

## **5.5 ALADIN for Optimal Power Flow**

Next, we investigate the convergence behavior of ALADIN (Algorithm 5) and bi-level ALADIN (Algorithm 8) for OPF. In all cases, we initialize ALADIN with a flat start<sup>3</sup> , we use the regularization technique from [Eng20a] and ALADIN parameters given in Table 5.3. Moreover, we use diagonal scaling matrices Σ with entries of 10<sup>2</sup> for voltage angles and magnitudes and entries 1 for active and reactive power injection.

Figure 5.7 shows the convergence of ALADIN for the 5-bus system with and without line limits, and for the IEEE 118-bus system. One can see that for all cases, ALADIN converges quite fast and almost independently of the problem size and the considered constraints (with/without line limits) in less than 20 iterations to a very high accuracy in all indicators. Observe that here we achieve accuracies of about 10−<sup>10</sup> whereas for ADMM we achieved 10−<sup>2</sup> . . . 10−<sup>3</sup> . This indicates that ALADIN is less scale-dependent and problem dependent compared with ADMM—likely a consequence of the relationship to SQP methods and because of constraint information in the coordination step. Moreover, note that in all cases we use a flat-start initialization—hence ALADIN does not rely on any sort of feasible initial guess. The large improvements in the accuracy of ALADIN compared with ADMM can especially be observed in the convergence of reactive power injections g : whereas after 15 iterations they are more or less converged for ALADIN in the 5-bus system (Figure 5.7

<sup>3</sup> In the OPF context, a flat start is the initial guess where the voltage magnitudes are initialized with 1 p.u. and all other values with 0.

Figure 5.7: ALADIN for the 5-bus system without line limits (yellow), with line limits (red) and for the 118-bus system (blue) with flat start.

red lines), for ADMM even after 1.500 itertions there seem to be significant changes regardless of the parameter ADM, cf. Figure 5.5.

### **5.5.1 Bi-level distributed ALADIN**

Next, we investigate convergence of bi-level ALADIN for the 118-bus OPF case. For b-CG we use a fixed number of 70 inner iterations and for b-ADMM we use 60, 100 and 200 inner iterations. Both variants require a minimum number of inner iterations (70 for b-CG and 60 for b-ADMM) to converge—i.e. to fulfill the accuracy requirement from inequality (4.33). We use ADM inn <sup>=</sup> <sup>10</sup>−<sup>2</sup> as the optimal penalization parameter for inner ADMM. For both inner algorithms, we use warm-starting. Pure ADMM uses ADM = 10<sup>4</sup> which was the optimal value achieving best overall performance.

Figure 5.8: ALADIN (yellow), bi-level ALADIN with CG (dashed blue), bi-level ALADIN with ADMM (dashed orange) and ADMM (purple) for the 118-bus system.

Figure 5.8 shows the convergence behavior of basic ALADIN, b-ADMM, b-CG and pure ADMM. The graphs for ADMM and for basic ALADIN are the same as before but displayed for the sake of comparison. One can see that b-CG reaches almost the same overall performance compared with basic ALADIN. This is interesting as CG also provides a limited accuracy only in the inner step.

For b-ADMM on the other hand, this is different: the achievable accuracy of the whole algorithm seems to depend on the number of inner iterations in ADMM. This can be seen that ∥ ∥<sup>∞</sup> and ∥ − ★∥<sup>∞</sup> stop decaying after around 10 outer iterations in Figure 5.8. Another difficulty when using ADMM as inner algorithm is tuning. We will investigate these two effects next.

Figure 5.9: Accuracy of inner ADMM after 100 iterations for different values of ADM inn and after ∈ {1, 4, 16} outer iterations of bi-level ALADIN for 118-bus OPF.

## **5.5.2 Tuning inner ADMM**

ADMM as an inner algorithm requires a serious amount of tuning. Figure 5.9 shows the accuracy in the inner problem (4.9), ∥˜ −˜ ∥∞, over the parameter ADM inn for a fixed amount of 100 inner d-ADMM iterations after ∈ {1, 4, 16} outer bi-level ALADIN iterations. One can see, that for the pair (˜<sup>1</sup> , ˜ 1 ), there are two different optimal values for ADM inn located at two different regions at around 10−<sup>2</sup> and at around 10<sup>0</sup> after one outer iteration. This is one reason making tuning difficult.

Moreover, the optimal ADM inn changes significantly while ALADIN proceeds: after four outer ALADIN iterations, i.e. for (˜<sup>4</sup> , ˜ 4 ), the optimal value is located at about 10<sup>0</sup> and after 16 iterations it is located at around 10<sup>2</sup> to 10<sup>3</sup> . This is a second reason making tuning of inner ADMM algorithm extremely difficult since this requires that ADMM has to be re-tuned after each outer ALADIN iteration.

A similar conclusion can be drawn from Figure 5.10. This figure shows the residual of the coordination step (4.9), ∥˜ − ˜ ∥∞, for d-ADMM (Algorithm 6) and for d-CG (Algorithm 7) in the first ALADIN iteration (left) and after 16 ALADIN iterations (right). Also from this figure one can conclude that the optimal parameter for inner ADMM changes by four orders of magni-

Figure 5.10: Convergence of d-CG and d-ADMM for the 118-bus OPF problem different values of ADM inn after one outer ALADIN iteration (left) and after 16 outer iterations (right).

tude from about 10−<sup>2</sup> in the beginning to 10<sup>2</sup> . Moreover one can see another interesting effect which is a possible explanation for the limited accuracy from Figure 5.10: whereas d-ADMM converges to a level of a bout 10−<sup>10</sup> in the beginning of the ALADIN iterates, in later iterations the achievable accuracy seems to be limited to 10−<sup>3</sup> Note that this can *not* be influenced by tuning of ADM inn . We tried different tuning parameters and ADM inn <sup>=</sup> <sup>10</sup><sup>2</sup> showed the best convergence and all other tuning parameters led to worse results. This effect seems interesting not to be investigated in the literature so far. In Appendix C we elaborate on this effect and we present an example, where this effect occurs already for very small problems.

For d-CG, in contrast, this effect does not occur and d-CG converges to high accuracies in all cases. Moreover, d-CG is parameter-free and does thus not require any tuning.

*Remark 19 (Preconditioning for d-ADMM/d-CG)* By preconditioning of (4.9), one can significantly accelerate convergence of d-ADMM and d-CG [GB17; Ste+20; Saa03]. However, note that preconditioning is also a centralized operation in general, where it has to be investigated whether the additional communication/coordination effort is outweighed by less iterations in d-ADMM/d-CG. In applications, where, the coefficient matrix ˜ does not change significantly from one step to another, offline preconditioning seems promising (e.g. in a control setting [Sta+16]). However, as we have shown previously, the coefficient matrix in bi-level ALADIN changes significantly and thus offline preconditioning seems less promising—in view of changes in the active set but


Table 5.4: Total forward communication until convergence (floats) for 118-bus OPF.

also in view of the fact that curvature information changes in each ALADIN outer iteration.

## **5.6 Comparing communication**

Next, we compare the forward communication from Section 4.6 for all algorithms numerically. Recall that in Section 4.6 we analyzed the communication effort *per iteration*. Here we analyze the *total amount of communication*, i.e. the number of floats which have to be communicated in sum until a pre-defined accuracy in the decision variables is reached. So here the "speed" of the algorithm plays a significant role since the amount of per-step communication is multiplied by the total number of iterations. We compare the algorithms for a given accuracy in the primal variables = ∥ − ★∥∞.

Table 5.4 and Table 5.5 show the total amount of forward communication for all investigated distributed optimization algorithms for the 118-bus OPF (Table 5.4) example and the 5-bus OPF example with line limits (Table 5.5). Hereby distinguish two different values for the accuracy for both grids: a low accuracy case and a high accuracy case. As predicted in Section 4.6, the condensed ALADIN variant significantly reduces the amount of required communication. For the 118-bus system, the reduction is quite large with a factor of round about 60 and for the 5-bus system with a factor of 5 a bit smaller. Note that the reduction here mainly depends on the number of coupling


Table 5.5: Total forward communication until convergence (floats) for 5-bus OPF with line limits.

variables in relation to the number of decision variables . For the 5-bus system this ration is quite large making the reduction in communication smaller.

Moreover, one can observe that bi-level ALADIN with CG requires round about the same communication for the 118-bus system as the reduced space variant—but now this is mainly *local* communication. For the 5-bus system, b-CG requires slightly more communication than condensed ALADIN. b-ADMM needs an about a factor 2-3 larger amount of local communication mainly due to the slower convergence in the inner problem compared with b-CG.

Pure ADMM requires less communication compared with all other variants in the low-accuracy case ( = 10−<sup>2</sup> ) for the 118-bus system. However, as ADMM typically becomes quite slow in later iterations, the gap becomes smaller with higher accuracy requirements. For an accuracy of = 10−<sup>4</sup> for example, the reduction factor of ADMM is only 2 instead of 5 for = 10−<sup>2</sup> .

For the 5-bus system with line limits (Table 5.5) this effect is extreme: as ADMM becomes very slow here (cf. Figure 5.5), ADMM needs more communication than b-CG. Moreover, the achievable accuracy of pure ADMM within 2, 000 iterations is no more than 10−<sup>2</sup> . Moreover, b-ADMM did not converge at all for this case since the achiveable accuracy in the inner problem was not high enough to achieve an accuracy of 10−<sup>1</sup> .

# **5.7 Summary and conclusion**

The previous sections showed that ADMM can be applied successfully to OPF problems and converges quite robustly in many cases. However, ADMM has in general no convergence guarantee and divergence seems to occur in practice [Chr+17a]. Moreover, often convergence is possible only up to a limited accuracy and ADMM might require a large number of iterations to achieve even that accuracy. Although this level of accuracy is often sufficient in an OPF context as many parameters such as the power demands are anyways known only up to a limited certainty [Cap16], one has to be aware of the fact that a consensus violation ∥∥ ≠ 0 always means that the solution is infeasible to some degree. This implies that active/reactive power at interconnection points do not match perfectly and when applying ADMM to real systems and one has to account for that with and underlying controllers compensating this mismatch. Furthermore, we showed that the performance of ADMM can vary greatly depending on the type of constraints considered (with or without line limits for example).

We also showed that a feasible initialization in combination with high penalization parameters leads to feasible but not necessarily optimal solutions for ADMM. Moreover, we concluded that that consensus violation (∥ ∥∞) alone, as sometimes used in the literature [GHT17; Ers15], is in general insufficient for termination. Especially in combination with the above combination of a feasible initialization with a feasible initial point this can lead to an arbitrary fast termination of ADMM. In this case, one can essentially enforce convergence of ADMM in one step in this case by choosing a sufficiently large although the current iterate is far from a local minimizer. Moreover, the assumption of a feasible initial guess seems strong. Finding a feasible point has about the same complexity as the full OPF problem itself and again requires a strong central coordinator jeopardizing the goals of distributed optimization.

### **Applying ALADIN variants**

As an alternative to ADMM, we proposed different ALADIN variants. AL-ADIN has the appealing property of a theoretically very fast local convergence, which we also observed in practice. One drawback of ALADIN are its comparably large communication and coordination requirements. As an alternative,


Table 5.6: ADMM, ALADIN and bi-level ALADIN for AC OPF.

we proposed a condensed variant of ALADIN, substantially reducing coordination and communication requirements. Moreover, with bi-level ALADIN, decentralized is also possible, while preserving ALADIN's fast local convergence properties. We showed that the performance of bi-level ALADIN with conjugate gradients (b-CG) is almost indistinguishable from basic ALADIN's performance although conjugate gradients introduce inexactness in the coordination step. For b-ADMM this is different: the achievable accuracy of b-ADMM seems to depend on the number of inner ADMM iterations. This comes from the fact that ADMM is able to solve the coordination problem up to a certain accuracy only and this effect can not be overcome by tuning. We also showed that this effect occurs already for very small-scale problems (cf. Appendix C). Moreover, tuning of inner ADMM can be difficult. However, b-ADMM offers full decentralization in the sense that it does not require any form of global scalar sum such as b-CG. The convergence rate of all the before mentioned ALADIN variants seems to be independent from the problem size and also from the considered constraints, which was not the case for ADMM. For large problems tuning of all ALADIN variants and also ADMM becomes increasingly difficult.

The properties of ADMM, ALADIN, and bi-level ALADIN are summarized in Table 5.6.

#### **Choosing an algorithm**

Choosing an appropriate algorithm is difficult. Whereas ADMM works well in many cases, convergence might be very slow and thus ADMM might take a long time to converge. The ALADIN variants usually converge very fast and decentralization is possible via bi-level ALADIN although tuning might become difficult for larger grids. If a high accuracy is required, ALADIN is certainly first-choice. Moreover, ALADIN is guaranteed to converge for an initial point close enough to a local minimizer, which ADMM is not. In terms of total communication, ADMM has often a slightly lower footprint.

# **6 The ALADIN- toolbox**

In this chapter, we present an open-source MATLAB toolbox, ALADIN- , implementing ALADIN, bi-level ALADIN and ADMM with a unified interface.

Major parts of this chapter are from [Eng+20]. We limit ourself to a brief overview here, for a more detailed description of options and subroutines we refer to [Eng+20] and to the documentation website [Eng20b].

# **6.1 Features of ALADIN-**

ALADIN- is intended for rapid prototyping of distributed and decentralized optimization algorithms and aims at user-friendliness. The only user-provided information are objective and constraint functions—derivatives and numerical solvers are generated automatically by algorithmic differentiation routines and external state-of-the-art NLP solvers. A rich set of examples coming with ALADIN- covers problems from robotics, power systems, sensor networks and chemical engineering underpinning its application potential.

ALADIN- supports


The ADMM variant from Algorithm 4 is also included. ALADIN- can be executed in parallel mode via the MATLAB parallel computing toolbox. This can lead to a substantial speed-up, for example, in distributed estimation problems. However, although distributed optimization can be used for parallel computing, computation speed or real-time feasibility is currently *not* a primary focus of the toolbox. A documentation and many application examples of ALADIN- are available under [Eng20b].

# **6.2 Literature review**

Despite the various practical applications of distributed optimization, only very few software toolboxes are freely available. Even if one can find one of the rare examples, these tools are typically tailored to a specific application at hand and typically address *convex* problems. Examples are several versions of ADMM applied to a plethora of applications summarized in [Boy+11], with code available under [Boy+20]. An implementation of ADMM for consensus problems can be found under [Mot+20]. A tailored implementation of ADMM for OPF problems using an algorithm from [GHT17] can be found under [Guo20]. However, there is a lack of multi-purpose software tools for distributed optimization and we were not able to find any open-source implementations for generic distributed non-convex optimization problems. Also with respect to decentralized non-convex optimization, we were not able to find any publicly available code.

However, for parallel optimization efficient structure-exploiting tools exist. A closed-source parallel interior point software is OOPS [GG07]. The opensource package qpDunes is tailored towards parallel solutions of QPs arising in model predictive control [FSD15]. For general QPs, the partially parallelizable solver OSQP seems promising [Ste+20]. PIPS is a collection of algorithms solving structured linear programs, QPs, and general NLPs in parallel [CPZ14; Lub+11]. The software HiOp is tailored towards structured and very large-scale NLPs with few nonlinear constraints based on interior point methods [Pet19; PCA19]. Moreover, combining parallel linear algebra routines (e.g. PARDISO [Sch+01]) with standard nonlinear programming solvers (e.g. IPOPT [WB06]) also leads to partially parallel algorithms [Cur+12; KFS18]. All these tools are implemented in low-level languages such as C or C++ leading to a high computational performance. However, their focus is mainly on computational

, ∀ ∈ R, (6.1b)

speedup via parallel computing rather than distributed and decentralized optimization in a multi-agent setting.

## **6.3 Parametric problem setup**

ALADIN- solves structured optimization problems of the form

$$\min\_{\mathbf{x}\_1,\dots,\mathbf{x}\_{n\_{\mathcal{R}}}} \quad \sum\_{i \in \mathcal{R}} f\_i(\mathbf{x}\_i, p\_i) \tag{6.1a}$$

subject to ( , ) = 0 |

$$h\_i(\mathbf{x}\_i, p\_i) \le 0 \qquad \mid \mathbf{y}\_i, \qquad \forall i \in \mathcal{R}, \tag{6.1c}$$

$$\underline{\underline{x}\_{i}} \le \underline{x}\_{i} \le \overline{\underline{x}\_{i}} \qquad \qquad \mid \; \eta\_{i}, \qquad \forall i \in \mathcal{R}, \tag{6.1d}$$

$$\sum\_{i} A\_{i} \underline{x}\_{i} = b \qquad \qquad \mid \; \lambda, \tag{6.1e}$$

∈R which is a slight modification of (2.12) introducing box constraints separately.

We do so for numerical efficiency reasons—some local solvers are able to handle box constraints by specialized routines. Moreover, problem (6.1) allows for parameter vectors ∈ R . This can be useful for example in Model Predictive Control or if one would like to solve the same distributed problem multiple times for a variety of parameters.

## **6.4 Software structure**

## **6.4.1 Code structure**

In order to avoid side-effects and to make code-modification easy for beginners, we choose a procedual/functional programming style. We decided to implement all core features in MATLAB to enable easy rapid-prototyping. The overall structure of run\_ALADIN()—which is the main function of ALADIN-—is shown in Figure 6.1. First, a preprocessing step performs a consistency check of the input data and provides default options. The createLocSolAndSens() function initializes the local parameterized NLPs and sensitivities for all sub-


Figure 6.1: Structure of run\_ALADIN() in ALADIN-.

problems ∈ R. For constructing the local NLPs and sensitivities, we use CasADi due to its algorithmic differentiation features and the possibility to interface many state-of-the-art NLP solvers such as IPOPT [WB06; And+19]. CasADi itself relies on pre-compiled code making function and derivative evaluation fast. A reuse option allows reusing the CasADi problem setup. When the reuse mode is activated (e.g. when ALADIN- is used within an MPC loop), createLocSolAndSens() is skipped resulting in a significant speed-up for larger problems.

In the ALADIN main loop iterateAL(), the function parallelStep() solves the local NLPs and evaluates the Hessian of the Lagrangian (or it's approximation e.g. when BFGS is used), the gradient of the objective, and the Jacobian of the active constraints (sensitivities) at the NLP's solution. Regularization is done in parallelStep() if needed. Moreover, in case the nullspace method or bi-level ALADIN is used, the computation of the nullspcaces and the Schur-complements is done locally shifting substantial computational burden from the centralized coordination step to parallelStep(). The function updateParam() computes dynamically changing ALADIN parameters for numerical stability and speedup.

The coordination QP is constructed in the function createCoordQP(). We use problem (2.25) including slack variables for numerical stability. Different dense and sparse solvers for solving the coordination QP are available in solveQP(). Most of them are based on solving the first-order necessary

Figure 6.2: The sProb data structure for defining problems in form of (6.1).

conditions which is a linear system of equations. Available solvers are the MATLAB linear-algebra routines linsolve(), pinv(), the backslash operator and MA57 based on the MATLAB LDL-decomposition routine. Using sparse solvers can speed up the computation time substantially. Note that only MA57 and the backslash-operator support sparse matrices. The solver can be specified by setting the solveQP option. In case of convergence problems from remote starting points, it can help to reduce the primal-dual stepsize of the QP step by setting the stepSize in the options to a values smaller than 1.

## **6.4.2 Data structures**

The main data structure for defining problems in form of (6.1) is a struct called sProb (cf. Figure 6.2). In this data structure, the objective functions { }∈R and constraint functions { }∈R and {ℎ }∈R are collected in cells which are contained in a struct called locFuns. Furthermore, sProb collects lower/upper bounds (6.1d) in cells called llbx and uubx. The coupling matrices { }∈R are collected in AA. Optionally, one can provide NLP solvers and sensitivities in this case the problem setup in createLocSolAndSens() is skipped leading to a substantial speedup in runtime for larger problems. Optionally one specify initial guesses in zz0 and initial Lagrange multipliers lam0. The second ingredient for ALADIN- is an opts struct. There, one can specify the variant of ALADIN- and algorithmic parameters. A full list of options with descriptions can be found under [Eng20b].

ALADIN- returns a struct as output. This cell contains a cell of locally primal optimal solutions xxOpt with { ★ }∈R. lamOpt are the optimal lagrange multipliers for the consensus constraints (6.1e), ★. Moreover the field iter contains information about the ALADIN iterates such as primal/dual iterates and timers contains timing information. Note that run\_ALADIN() and run\_ADMM() have the same function signature in terms of sProb—only the options differ.

## **6.5 A tutorial example**

Next, we provide two numerical examples illustrating how to use ALADIN in practice. The first one is a minimalistic example showing how to formulate problems in form of (6.1) and solving this problem with ALADIN-.

First, we investigate how to reformulate a tutorial optimization problem in partially separable form 6.1. Let us consider the non-convex NLP

$$\min\_{\mathbf{x}\_1, \mathbf{x}\_2 \in \mathbb{R}} f(\mathbf{x}) = 2 \left(\mathbf{x}\_1 - 1\right)^2 + (\mathbf{x}\_2 - 2)^2 \tag{6.2a}$$

$$1 \text{ subject to } \quad -1 \le \mathbf{x}\_1 \cdot \mathbf{x}\_2 \le 1.5. \tag{6.2b}$$

In order to apply ALADIN-, problem (6.2) needs to be in form of (6.1). To get there, let us introduce auxiliary variables 1, <sup>2</sup> with <sup>1</sup> ∈ R and <sup>2</sup> = (<sup>21</sup> 22) ⊤. Let us couple these variables again by introducing a consensus constraint Í = 0 with <sup>1</sup> = 1 and <sup>2</sup> = (−1 0). Furthermore, let us reformulate the objective function by local objective functions <sup>1</sup> (1) := 2 (1−1) 2 and <sup>2</sup> (2) = (22−2) <sup>2</sup> with = <sup>1</sup> + 2. Moreover, we reformulate the global inequality constraint (6.2b) by a local two dimensional constraint function ℎ<sup>2</sup> = (ℎ<sup>21</sup> ℎ22) <sup>⊤</sup> with ℎ<sup>21</sup> (2) = −1 − <sup>21</sup> <sup>22</sup> and ℎ<sup>22</sup> (2) = −1.5 + <sup>21</sup> 22. Combining these reformulations yields

$$\min\_{\mathbf{y}\_1 \in \mathbb{R}, \mathbf{y}\_2 \in \mathbb{R}^2} 2 \left( \mathbf{y}\_1 - 1 \right)^2 + (\mathbf{y}\_{22} - 2)^2 \tag{6.3a}$$

$$\text{subject to}\quad -1 - \mathbf{y}\_{21}\mathbf{y}\_{22} \le \mathbf{0}, \quad -1.5 + \mathbf{y}\_{21}\mathbf{y}\_{22} \le \mathbf{0},\tag{6.3b}$$

$$\begin{array}{ll} \text{y}\_1 & + \ (-1 \ 0) \text{ y}\_2 = 0, \\ \end{array} \tag{6.3c}$$

which is in form of problem (6.1). Note that the solutions to (6.2) and (6.3) coincide but (6.3) is of higher dimension, thus one can view the reformulation as a kind of lifting to a space of higher dimensionality. Moreover, observe that this reformulation contains a general strategy for reformulating problems in form of (6.1): if there is nonlinear coupling in the objective functions or the constraints, introducing auxiliary variables and enforcing them to coincide

```
% define symbolic variables
y1 = sym('y1',[1,1],'real');
y2 = sym('y2',[2,1],'real');
% define symbolic objectives
f1s = 2*(y1−1)^2;
f2s = (y2(2)−2)^2;
% define symbolic ineq. constraints
h2s = [ −1−y2(1)*y2(2); ...
−1.5+y2(1)*y2(2)];
% convert symbolic variables to
% MATLAB functions
f1 = matlabFunction(f1s,'Vars',{y1});
f2 = matlabFunction(f2s,'Vars',{y2});
h1 = @(y1)[];
h2 = matlabFunction(h2s,'Vars',{y2});
                                        % define symbolic variables
                                        y_1 = SX.sym('y_1', 1);
                                        y_2 = SX.sym('y_2', 2);
                                        % define symbolic objectives
                                        f1s = 2 * (y_1 − 1)^2;
                                        f2s = (y_2(2) − 2)^2;
                                        % define symbolic ineq.
                                          constraints
                                        h1s = [];
                                        h2s = [ −1 − y_2(1)*y_2(2); ...
                                        −1.5 + y_2(1)*y_2(2)];
                                        % convert symbolic variables to
                                        % MATLAB functions
                                        f1 = Function('f1', {y_1}, {f1s});
                                        f2 = Function('f2', {y_2}, {f2s});
                                        h1 = Function('h1', {y_1}, {h1s});
                                        h2 = Function('h2', {y_2}, {h2s});
```

```
% define objectives
f1 = @(y1) 2 * (y1 − 1)^2;
f2 = @(y2) (y2(2) − 2)^2;
% define inequality constraints
h1 = @(y1) []
h2 = @(y2) [ −1 − y2(1) * y2(2);...
−1.5 + y2(1) * y2(2)];
```
Figure 6.3: Tutorial example with three different ways of problem setup.

by an additional consensus constraint in form of (6.1e) yields purely affine coupling. With that strategy, one can reformulate most nonlinear program in form of (6.1e).

**Solution with ALADIN-**

```
% define coupling matrices
A1 = 1;
A2 = [−1, 0];
% collect problem data in sProb struct
sProb.locFuns.ffi = {f1, f2};
sProb.locFuns.hhi = {h1, h2};
sProb.AA = {A1, A2};
% start solver with default opts
sol = run_ALADINnew(sProb);
```

```
================================================
== This is ALADIN−alpha v0.1 ==
================================================
QP solver: MA57
Local solver: ipopt
Inner algorithm: none
No termination criterion was specified.
Consensus violation: 6.6531e−12
Maximum number of iterations reached.
−−−−−−−−−−−−− ALADIN−alpha timing −−−−−−−−−−
t[s] %tot %iter
Tot time......: 3.92
Prob setup....: 0.19 4.8
Iter time.....: 3.72 95
−−−−−−−−−
NLP time......: 1.1 29.7
QP time.......: 0.11 2.8
Reg time......: 0.02 0.6
Plot time.....: 2.27 60.8
================================================
```
Figure 6.4: Collection of variables and output of ALADIN-.

Figure 6.5: ALADIN- iteration plot for tutorial problem (6.3).

To solve (6.3) with ALADIN-, we set up our problem formulation in a struct sProb as described in Section 6.4.2. To illustrate different possibilities of problem setup for ALADIN-, we construct the objective and constraints functions in three different ways: a), via the MATLAB symbolic toolbox, b), via the CasADi symbolic framework and, c), directly via function handles. All these ways are shown in Figure 6.3.

After defining objective and constraint functions, all function handles and the coupling matrices are collected in the struct sProb (Figure 6.4). We call the run\_ALADIN() function with an empty options struct leading to computation with default parameters. These steps and the resulting ALADIN- report after running run\_ALADIN() is shown on the right pane of Figure 6.4. In the ALADIN- report, the reason for termination and timing of the main ALADIN- steps is displayed. Note that plotting takes a substantial amount of time—so it advisable to deactivate online plotting if it is not needed for diagnostic reasons. Figure 6.5 shows the plotted figures while ALADIN- is running. The figures show (in this order) the consensus violation ∥ − ∥∞, the local step sizes ∥ − ∥∞, the step size in the coordination step ∥Δ ∥∞ and the changes in the active set. From these figures one usually can recognize divergence quite fast and also can get a feeling on the effectiveness e.g. for new internal heuristics or the degree of accuracy reached after a certain number of iterations.

## **6.6 Applications beyond optimal power flow**

ALADIN- comes with a rich set of examples. These examples include


Table 6.1: Application examples of ALADIN-.


Furthermore, we included several test problems form the Hock-Schittkowski test collection [HS80]. The code for all these examples is available in the examples\ folder of ALADIN-. Furthermore, we provide textual descriptions of these examples in the documentation of ALADIN- online [Eng20b]. The application examples are summarized in Table 6.1.

## **6.7 Distributed control of a chemical reactor**

Next, we show how to use ALADIN- for distributed optimal control of a chemical reactor. This OCP can serve as a basis for distributed model predictive control [RMD17; SWR11; MA17]. The chemical process we consider here consists of two CSTRs and a flash separator as shown in Figure 6.6 [Cai+14; CLM11]. The goal is to steer the reactor to the optimal setpoint

$$
\mu\_s^\top = (0, \, 0 \, 0 \, ).
$$

Figure 6.6: Reactor-separator process.

and

 $\mathbf{x}\_s^T = (369.53 \ 3.31 \ 0.17 \ 0.04 \ 435.25 \ 2.75)$ 
$$\mathbf{0.45 \ 0.11 \ 435.25 \ 2.88 \ 0.50 \ 0.12})$$

from an initial point

 $\mathbf{x}(0)^\top = (360.69 \ 3.19 \ 0.15 \ 0.03 \ 430.91 \ 2.76$ 
$$\text{0.34 } 0.08 \ 430.42 \ 2.79 \ 0.38 \ 0.08).$$

After applying a fourth-order Runge-Kutta scheme for discretization, the dynamics of the CSTRs and the flash separator are given by

$$
\boldsymbol{\alpha}\_{i}^{k+1} = q\_{i}(\boldsymbol{\alpha}\_{i}^{k}, \boldsymbol{\mu}\_{i}^{k}, \boldsymbol{z}\_{i}^{k}) \quad \text{for} \quad i \in \mathcal{R},
$$

where : R × R × R → R are the dynamics of the th vessel and where R := {1, 2, 3} is the set of vessels. Here, ⊤ = (, , , ) are the states with , (10<sup>3</sup> mol/m<sup>3</sup> ) being the concentrations of the reactants, , and and (K) is the temperature. The inputs = (10<sup>3</sup> J/h) denote the heat-influxes of the individual vessel and are copied states of the neighbored reactors influencing reactor , i.e. := ( )∈ () . Note that the feed-stream flow rates 10, 20, 3, and are fixed and given. Detailed equations for the dynamics of the CSTRs/separator are given in [CLM11].

#### **The optimal control problem**

With the above, we are ready to formulate a discrete-time optimal control problem

min ( , , ) ∀∈I[<sup>1</sup> ] ∀∈R ∑︁ ∈R ∑︁ ∈I[<sup>1</sup> ] 1 2 ( − ) <sup>⊤</sup> ( <sup>−</sup> ) + <sup>1</sup> 2 ( − ) <sup>⊤</sup> ( − ) s.t. +1 − ( , , ) = 0, <sup>0</sup> = (0) ∀ ∈ I[<sup>1</sup> ] , ∀ ∈ R, ≤ ≤ ¯ , ≤ , ∀ ∈ I[<sup>1</sup> ] , ∀ ∈ R, (6.4) ∑︁ ∈R ⊤ ⊤ ⊤ ⊤ = 0 ∀ ∈ I[<sup>1</sup> ] ,

with lower/upper bounds on the inputs = − = (5 · 10<sup>4</sup> 1.5 · 10<sup>5</sup> 2 · 10<sup>5</sup> ) ⊤ and lower bounds on the states = 0 for all times ∈ I[<sup>1</sup> ] and all vessels ∈ R. The weighting matrices are chosen to = diag(20 10<sup>3</sup> 10<sup>3</sup> 10<sup>3</sup> ) and = 10−10. The matrices are selected such that they represent the constraint := ( )∈ () . The sampling time is Δℎ = 0.01ℎ and the horizon is = 10 h. By defining ˜ ⊤ := ⊤ ⊤ ⊤ ∈I[<sup>1</sup> ] , (˜) := Í ∈I[<sup>1</sup> ] 1 2 ( − ) <sup>⊤</sup> ( − ) + Í ∈I[<sup>1</sup> −1] 1 2 ( − ) <sup>⊤</sup> ( − ), (˜) := +1 − ( , , ) ∈I[<sup>1</sup> −1] , and ℎ(˜) := ( − − ¯ − ) ⊤ ∈I[<sup>1</sup> ] one can see that the OCP (6.4) is in form of (6.1) and thus solvable by ALADIN- (˜ here corresponds to in (6.1)).

#### **Numerical results**

Figure 6.7 shows the resulting input and state trajectories for one OCP (6.4) for basic ALADIN and ADMM after 20 iterations, and for ADMM after 100 iterations. At first-glance all trajectories are quite close to each other. However, small differences in the input trajectories can be observed. Figure 6.9 shows the convergence indicators from Chapter 5 over the iteration index . In logarithmic scale, these differences can be quite large. Fore example the

Figure 6.7: Optimal input trajectories computed by ALADIN & ADMM.

Figure 6.8: Optimal state trajectories computed by ALADIN & ADMM.

Figure 6.9: Convergence of ALADIN and ADMM for OCP (6.4).

consensus gap ∥ −∥<sup>∞</sup> is in an order of 10<sup>1</sup> after 20 iterations which means that the physical values at the interconnection points have a maximum mismatch of 10<sup>1</sup> . ALADIN converges quite fast and also to a very high accuracy. All trajectories were computed with run\_ALADIN() and run\_ADMM().

# **7 Summary and Outlook**

This thesis aimed at designing distributed and decentralized optimization algorithms for problems with non-convex constraints. In this context, a fast convergence under limited information exchange is key.

## **Chapter 2—Basics of Distributed Optimization**

To build a foundation for algorithm development and for reviewing the literature of current distributed optimization methods, we briefly recalled distributed optimization algorithms such as the classical Alternating Direction Method of Multipliers (ADMM) and the more recent Augmented Lagrangian Alternating Direction Inexact Newton (ALADIN). We provided illustrative examples showing that classical algorithms such as ADMM might exhibit slow converge due to the lack of constraint information in the coordination step in combination with alternating projections. We also showed that ALADIN is able to overcome these limitations due to considering constraint information in its coordination step. Moreover, we emphasized that for certain non-convex problems, global convergence is out of reach—even for centralized algorithms.

## **Chapter 3—A survey on distributed optimization**

We provided a literature review on distributed optimization in Chapter 3. Here, we showed that distributed and decentralized optimization algorithms from different communities are barely able to handle constrained non-convex problems from power systems and control we have in mind. To this end, we categorized the literature on distributed optimization along three main lines of research: primal algorithms, primal-dual algorithms and internal decomposition methods. Typically, the communities working on either of the above research lines are mostly disconnected—this literature review attempted to take a more general perspective.

We showed that especially primal-dual algorithms and internal decomposition methods seem to be promising because of their effective constraint-handling capabilities. We classified ALADIN as a combination of primal-dual and internal decomposition methods combining the best of these two lines of research: distribution and fast convergence for non-convex problems. We emphasized that—despite of its very promising convergence properties—drawbacks of AL-ADIN are its high communication and coordination requirements, and a lack of decentralization.

## **Chapter 4—Bi-level Distributed ALADIN**

In Chapter 4 we developed one of the first frameworks for decentralized optimizatio with convergence guarantees for problem with non-convex constraints. We presented two specific decentralized algorithms for distributing the coordination step of ALADIN: a decentralized variant of ADMM and a novel decentralized variant of the conjugate gradient algorithm. Decentralized conjugate gradient has the appealing property of a convergence in a finite number of steps, whereas state-of-the-art algorithms such as ADMM are often guaranteed to converge at a linear rate at most and the convergence modulus can be slow. Decentralized conjugate gradients and decentralized ADMM introduce inexactness into the coordination step of ALADIN. We showed mathematically that, despite this inexactness, the very fast local convergence properties of ALADIN can be preserved if the error in the coordination step decays fast enough.

### **Possible directions for future work**

As bi-level ALADIN is guaranteed to convergence locally at present, further research on distributed and decentralized globalization routines seems to be important. Moreover, the amount of tuning one has to invest typically increases with the problem size. Globalization routines and internal auto-tuning routines are thus promising candidates to address these challenges.

The amount of communication and the overall convergence of bi-level AL-ADIN depends on the accuracy and thus on the number of inner iterations in the coordination step. Distributed preconditioning might help here to reduce the number of inner iterations. Similarly, a dynamic adjustment of the termination criterion in the coordination step might help to decrease the total communication requirements of by using less inner iterations.

## **Chapter 5—Application to Power Systems**

In Chapter 5, we evaluated the performance of bi-level ALADIN on optimal power flow problems up to several hundred buses. We started by reviewing relevant literature in this context and concluded that current approaches either have no convergence guarantee for AC OPF, use oversimplified models or relaxations, where the solution to the original problem might not be recoverable from the relaxed problem. Other algorithms require a large amount of central coordination.

We proposed ALADIN and bi-level ALADIN as alternatives and compared their performance ADMM as one of the most-frequently used distributed optimization methods for OPF. We showed that bi-level ALADIN is able to converge much faster than ADMM—also for a limited amount of inner iterations. Moreover, we highlighted that bi-level ALADIN with conjugate gradients is able to reach about the same low communication footprint of ADMM for certain problems, while converging to a much higher accuracy.

We also showed that for bi-level ALADIN with ADMM as an inner algorithm, the achievable accuracy is often limited due to the limited accuracy achievable by ADMM in the inner loop. This effect seems not to be investigated in the literature so far. Moreover, we showed that tuning of inner ADMM might be difficult since the optimal penalization parameter changes with the outer ALADIN iterations.

In the distributed OPF literature, high penalization parameters in combination with a feasible initial point is sometimes used for ADMM. We showed numerically and mathematically that, in its extreme, this combination makes ADMM getting stuck at the initial point. We also showed that pure ADMM is able to converge only up to a limited accuracy for the problems we consider.

## **Possible directions for future work**

The considered OPF problem forms the basis for many other problems such as reactive power dispatch problems, state estimation problems, power flow problems, and redispatch problems [Du+19; Mur+18] all of which might benefit from distributed optimization. Also multi-stage OPF problems, where OPF problems are coupled in time via storages might be of interest [FE19]. Broadening the view from power systems to energy systems, application in other domains such as building control [Su+20; Zwi+19] seems promising as also here a limited information exchange and decentralized computation are often desirable.

Investigating the slow convergence of ADMM for certain problem classes and particularly in context of OPF in more detail seems to be important. Especially the modulus switching from Appendix C is key, since slow convergence can be observed even for very small problems without constraints. This is relevant for bi-level ALADIN, as ADMM as inner algorithm suffers from this slow convergence and make the achievable accuracy of bi-level ALADIN with ADMM limited for OPF.

Tuning of bi-level ALADIN becomes increasingly difficult with a growing problem size—also for problems from power systems. Hence, testing new globalization routines and internal heuristics on problems from power systems seems worth investigating.

## **Chapter 6—The ALADIN- toolbox**

In Chapter 6, we presented one of the first general-purpose open source toolboxes for decentralized non-convex optimization named ALADIN- implementing the algorithms from the preceding chapters. The toolbox enables rapid-prototyping of distributed and decentralized algorithms in a modular framework. The MATLAB toolbox comes with a rich set of code examples for distributed and decentralized optimization from different engineering fields ranging from power systems, via chemical engineering, to mobile sensor networks and robotics to machine learning, highlighting its broad applicability.

We investigated numerical performance of ALADIN in comparison with ADMM on an optimal control problem for a three-vessel chemical reactor. Moreover, ALADIN- comes with advanced features such as parallel computing and parametric programming enabling its usage in context of distributed model predictive control.

### **Possible directions for future work**

It seems promising to develop a generalized variant of ALADIN-, where the user can pass own solvers and differentiation routines eliminating the dependency on the automatic differentiation tool CasADi. This might help to speed up computation—especially for large problems.

For optimal control, a real-time implementation of ALADIN- with code generation seems promising. An implementation in free languages such as Julia or Python also seems important to enlarge the possible user base.

# **Closing remarks**

With bi-level ALADIN, we presented one of the first families of algorithms for decentralized optimization with non-convex constraints. We showed that bi-level ALADIN is guaranteed to converge fast under a limited information exchange. We demonstrated that these properties also hold in practice for relevant problems from power systems and control. Moreover, we presented one of the first toolboxes for decentralized non-convex optimization implementing the algorithms from this thesis.

However, these are merely first steps and future will tell which algorithms work robustly for a broad variety of problems.

# **A Mathematical Background**

## **A.1 Mathematical Basics**

We recall some basic mathematical results, which we used in our derivations.

#### **Matrix norms**

Recall that the following properties of matrix/vector norms for matrices , ∈ R × and scalars ∈ R must hold


Note that we assume in all our derivations that matrix norms are *induced* by their corresponding vector norm, i.e. for a given vector norm ∥ · ∥ on R , the induced matrix norm for a matrix ∈ R × with respect to ∥ · ∥ is ∥∥ := sup∈R, <sup>∥</sup> <sup>∥</sup>=<sup>1</sup> ∥∥. For induced matrix norms, the following property holds additionally

$$\|\|AB\|\| \le \|\|A\|\| \|B\|\|\tag{10.4} \quad\text{(sub-multiplicity)}.$$

Moreover we will need that given two matrices ∈ R ×, 0 ∈ R × , we have ∥ ( 0) ∥ = sup∥ ( ⊤ <sup>⊤</sup> ) ∥=1 ∥ ( 0) ( ⊤ ⊤) <sup>⊤</sup>∥ = sup<sup>∥</sup> <sup>∥</sup>=<sup>1</sup> ∥∥ = ∥∥, where the last step follows from the fact that the fact that the choice of does not change the supremum of ∥∥.

Another important property which we will use is the "triangular inequality for integrals" for integrable functions : R → R ,

$$\left\| \int\_{a}^{b} F(t)dt \right\| \leq \int\_{a}^{b} \|F(t)\| dt,\tag{A.2}$$

which can be shown by the above properties and the definition of the Riemann integral.

### **Definiteness of matrices**

A matrix ∈ R × is called *positive definite* if <sup>⊤</sup> ≥ 0 for all ∈ R . If the before inequality holds strictly, is called *strictly positive definite*. Equivalently, we write ⪰ 0 and ≻ 0.

### **Fundamentals from calculus**

By the *fundamental theorem of calculus* we have for any continuouslydifferentiable function : [ ] → R

$$\int\_{t\_1}^{t\_2} \mathbf{g}'(t)dt = \mathbf{g}(t\_1) - \mathbf{g}(t\_2).$$

For an extension to the multivariate case let us consider a continuously differentiable vector-valued function : R → R. Define () := ( + ( − )) and thus ′ () = ∇ ( + ( − ))⊤( − ). Then we have

$$\int\_0^1 \nabla f(a + t(b - a))^\top (b - a) \, dt = f(b) - f(a).$$

The extension to a vector field : R → R is done by concatenation of single-valued functions : R → R as () := ( <sup>1</sup> (), . . . , ())<sup>⊤</sup> yielding

$$\int\_0^1 \nabla F(a + t(b - a))^\top (b - a) \, dt = F(b) - F(a), \qquad (\text{A.3})$$

for some , ∈ R , where ∇ is the Jacobian of and the integral is evaluated component-wise.

#### **Convergence of sequences**

In context of optimization algorithm, two question are of significant interest: first of all, the question "Does the sequence generated by the algorithm converge to a minimizer/stationary point?" and, secondly "If it converges, how fast does it converge to that point?".

Let us address the first question. A sequence { } in the metric space (R , (, ) = ∥ − ∥) is called *convergent* to a point ★ if for any > 0 one can find an such that ∥ − ★∥ < for all > . Not that for a sequence satisfying ∥ +<sup>1</sup> − ★∥ ≤ ∥ − ★∥ with ∈ [0 1) one can always find such an , since by iterating this equality we have ∥ − ★∥ ≤ ∥ <sup>0</sup> − ★∥ ≤ by choosing an large enough. Thus, one can show convergence to ★ by showing that the above inequality holds.

To characterize the "speed" of convergence we distinguish between four basic convergence rates (with increasing speed): Q-sublinear, Q-linear, Qsuperlinear and Q-quadratic convergence. The "Q" here stands for quotient convergence. Consider a sequence { } that converges to . We say that this sequence is *Q-linearly convergent* to with modulus ∈ (0, 1) if<sup>1</sup>

$$\limsup\_{k \to \infty} \frac{|c\_{k+1} - c|}{|c\_k - c|} = \mu.$$

We say that { } is *Q-sublinearly convergent* if the above limit converges to 1 and *superlinearly convergent* if it converges to 0. We say that { } converges *Q-quadratically* to if

$$\limsup\_{k \to \infty} \frac{|c\_{k+1} - c|}{|c\_k - c|^2} = \mu < \infty.$$

<sup>1</sup> Linear convergence essentially means exponential convergence in the sense that a linearly convergent series can be upper bounded by a exponentially covergent sequence. The name "linearly" convergence comes from the fact that a linearly convergent series looks linear in a semilogarithmic plot.

Figure A.1: Convergence rates in comparison.

Note that in view of the above definition of linear convergence, if we can find a sequence satisfying | +<sup>1</sup> − | ≤ | − |, with ∈ (0, 1), linear convergence with modulus immediately follows. Similarly, we can conclude quadratic convergence from an inequality | +<sup>1</sup> − | ≤ | − | 2 . Figure A.1 shows examples of sublinear, linear, superlinear and quadratically convergent sequences.

Note that there is an issue in the above definitions in case the sequence can only be upper bounded by a fast convergent sequence but does not decrease in *each* step in the above sense. This might be problematic since some optimization methods do not produce a descent in the objective function value in each step. Consider for example the sequence = − (1 + sin(10 2 )). This sequence somehow converges "intuitively linearly" to zero but not in the sense of the above definition since lim sup→∞ − (+1) (1+sin(10 (+1) 2 ) ) <sup>−</sup> (1+sin(10 2 ) ) = −1 (1+sin(10 (+1) 2 ) ) (1+sin(10 2 ) ) <sup>&</sup>gt; 1.

Therefore, it makes sense to use a weaker notion of convergence rate, the *root* convergence (-convergence), where convergence is characterized by a majorizing, upper bounded, and -convergent sequence {¯ }. In view of that we say that the sequence { } *converges -*{*linearly, superlinearly,*

Figure A.2: -convergence vs. -convergence.

*quadratically*} to if there exists a -{linearly, superlinearly, quadratically} sequence {¯ } converging to zero such that

$$|c^k - c| \le \tilde{c}^k.$$

For the above example we can the define the -linearly convergent sequence ¯ = 2 <sup>−</sup> ≥ − (1 + sin(10 2 )) and thus -linear convergence of { } follows. Figure A.2 illustrates the above example. Note that trivially each -linearly convergent sequence is -linearly convergence but the converse is not true.


Moreover, there exist several ways of characterizing the convergence of optimization methods. They can be characterized in terms of a distance to a local minimizer (or a stationary point), i.e. ∥ − ★∥ = O (·), in terms of the objective function value | ( ) − ( ★)| = O (·) or in terms of stationarity of the objective function (in the unconstrained case) ∥∇ ( ) ∥ = O (·). In the present work, we refer to the first case, i.e. convergence estimates considering the distance to a local minimizer if not stated differently. Note that, however, particularly in context of distributed optimization, convergence is often defined in terms of the objective function value. Under certain conditions such as Lipschitz continuity or strong convexity one can transfer estimates in terms of the objective function to a result in terms of the distance to a local minimizer [Ber99, Chap 1.2] [Bec17] but not always. Sometimes convergence results are stated in terms of a specific accuracy, i.e. the number of iterations needed to reach a specific accuracy in the objective function value | ( ) − ( ★)| < [Bec+18] which we will not use in the present work.

## **A.2 Distributed problem formulations**

Here, we give a brief overview on common problem formulations used in the literature. Recall that in the present work we consider problems in form of (2.12),

$$\min\_{\mathbf{x}\_i, \dots, \mathbf{x}\_R} \sum\_{i \in \mathcal{R}} f\_i(\mathbf{x}\_i) \tag{A.4a}$$

$$\text{A subject to} \qquad \text{g}\_i(\mathbf{x}\_i) = \mathbf{0}, \qquad \qquad \forall i \in \mathcal{R}, \tag{A.4b}$$

$$h\_i(\mathbf{x}\_i) \le \mathbf{0}, \qquad \qquad \forall i \in \mathcal{R}, \tag{A.4c}$$

$$\sum\_{i \in \mathcal{R}} A\_i \boldsymbol{\chi}\_i = b,\tag{A.4d}$$

where , and ℎ are smooth. A common technique to simplify problem (A.4) is to replace by ˜ := + X , where is the indicator function of the set X := { ∈ R | () = 0, ℎ() ≤ 0}, cf. Section 2.2.1. This yields an equivalent problem

$$\min\_{\mathbf{x}\_1, \dots, \mathbf{x}\_N} \sum\_{i=1}^N \tilde{f}\_i(\mathbf{x}\_i) \tag{A.5a}$$

$$\text{subject to} \quad \sum\_{i \in \mathcal{R}} A\_i \mathbf{x}\_i = b. \tag{A.5b}$$

Note that if all are affine and all ℎ are convex, all ˜s are convex but possibly non-differentiable and discontinuous functions, which render derivative-based algorithms such as gradient-based methods or SQP methods non-applicable. However, some algorithms such as ADMM can handle such discontinuities under certain circumstances. By introducing auxiliary variables ∈ R , we can write (A.5) as

$$\min\_{\boldsymbol{x}, \boldsymbol{z}} \tilde{f}(\boldsymbol{x}) + \iota\_C(\boldsymbol{z})$$
 
$$\text{subject to } \boldsymbol{x}\_i - \boldsymbol{z}\_i = \boldsymbol{0}, \qquad \forall i \in \mathcal{R}$$

with ˜ <sup>=</sup> Í ∈R ˜ and where <sup>C</sup> <sup>=</sup> { <sup>∈</sup> <sup>R</sup> ∥ Í ∈R − = 0}. This problem is now in the famous *two-block* form

$$\min\_{\mathbf{x}, z} f(\mathbf{x}) + \mathbf{g}(z) \tag{A.6}$$

$$\text{subject to } A\boldsymbol{x} + B\boldsymbol{z} = \boldsymbol{c}, \tag{A.7}$$

used in many contexts of distributed optimization [BT89, Chap 3.4], [Bec17, Chap 10], [PB14; EB92; Och+14; BT09; DLS16; GB16; Boy+11]. This form is often used in so-called *operator splitting* schemes having a different view on distributed optimization from a operator-theoretic perspective [EB92; GOY17; BC11]. Note that the ADMM is shown to be a special case of the so-called *Douglas-Rachford*-splitting algorithm [Gab83]. Generally, splitting schemes can often efficiently exploit the fact that one of the two functions or is separable which is also here case for ˜ . Note that these splitting schemes often rely on convexity which can be ensured by choosing convex and ℎ and affine .

#### **Consensus problems**

In many cases, distributed optimization approaches consider unconstrained problems in form of

$$\min\_{\mathbf{x}} \sum\_{i \in \mathcal{R}} f\_i(\mathbf{x}). \tag{A.8}$$

in the literature [Shi+14; Shi+15b; NO09; MO17; DLS16; NOP10; Boy+11; ZK14; NOS17]. Here, the are typically continuous but they might not be differentiable, i.e. they might consider non-smooth components but do typically not encode constraints via the indicator function as in the previous subsection. Note that here all depend on *one common* decision vector . Similar to the above, by introducing copies of , ∈ R , it is possible to transform (A.8) into so-called *consensus form*

$$\min\_{\mathbf{x}, z\_i} \sum\_{i \in \mathcal{R}} f\_i(z\_i) \tag{A.9a}$$

$$\text{subject to } z\_i = \alpha \quad \text{for all} \quad i \in \mathcal{R} \tag{A.9b}$$

which is again in two-block form (A.6). Note that in this formulation, the network structure is not explicitly considered which makes this form interesting mostly in parallel computing contexts. Next, we will review problem structures considering the network structure.

#### **Connection to optimization over networks**

Many works consider distributed optimization over networks [DLS16; Shi+14; Shi+15b; NO09; TT17]. Therein, problem (A.8) is typically reformulated as consensus problems but subject to certain communication constraints which are encoded as a graph = {N, E}. One option for doing so [Shi+14] is to introduce global auxiliary variables only for the edges (, ) ∈ E yielding

$$\min\_{\mathbf{x}\_i, \mathbf{x}\_{ij}} \sum\_{i \in \mathcal{R}} f\_i(\mathbf{x}\_i) \tag{A.10}$$

$$\text{subject to } \mathbf{x}\_i = \mathbf{z}\_{if}, \ \mathbf{x}\_j = \mathbf{z}\_{if} \text{ for all } (i, j) \in \mathcal{E}. \tag{A.11}$$

Note that (A.10) is again in two-block form (A.6). An alternative option is copying all variables and enforcing consensus directly via characteristic matrices of such as the Laplacian matrix ∈ R |N |× |N | or its incidence matrix ∈ R |N |× | E | [Dor18] for one-dimensional problems. This yields

$$\min\_{\mathbf{x}\_i} \sum\_{i \in \mathcal{R}} f\_i(\mathbf{x}\_i) \tag{A.12}$$

subject to = 0 or ⊤ = 0. (A.13)

Applying ADMM or a gradient/subgradient method reveals that it suffices to exachange information only between neighbors, i.e. over the edges (, ) ∈ E. Note that due to these reformulations, properties of such as connectedness or algebraic connectivity influence the convergence properties of the resulting decentralized algorithms, where stronger connectivity typically leads to faster convergence [Shi+14].

Algorithms for (A.8) are developed without explicitly introducing the graph properties into the problem formulation [NO09; Shi+15b; NOS17; DLS16; YLY16]. In these works, typically matrices encoding connectivity information are directly introduced into the iteration schemes e.g. of a gradient method. Recent overviews on optimization over networks can be found in [NOW18; NOR18], cf. [Dor18] for an introduction from a control-theoretic perspective.

#### **The here-used connectivity encoding technique**

In the present work we use network encoding which is closets related to the edge-based technique form the previous paragraph. Our formulation is more general as the before-introduced techniques in the sense that it allows for multiple decision variables per node, where all these decision vectors can be of a different dimension. This already shows that in the context of our work we mainly rely on "complexity in the nodes" in the sense that usually we have few nodes, each of which has a quite complex subproblem. Rather than defining the graph a-priori, we follow a different route: we define our communication graph based on the sparsity in the consensus matrices { }. In view of (A.4), we have a node set R to each of which we assign a decision vector , for all ∈ R. We define the edges of the graph via the non-zero entries in { }: two subsystems , ∈ R are called *neighbored* (and thus we assign an edge (, ) to the set E) if there is a row ∈ C which is non-zero in bot corresponding

 and . This way, our formulation is more general compared to the other formulations above since we allow for multiple variables per node and for multiple interconnections between two nodes.

Moreover, other couplings exists such as objective coupling, where the coupling between variables takes place in the objective function. However, typically they can be reformulated in constrained-coupled form. We refer to [NND11] for a more detailled treatment in context of control.

## **A.3 Consensus reformulation preserves LICQ**

In order to use distributed optimization methods, it is often necessary to reformulate problems in affinely-coupled separable form (2.12) as we have seen for OPF in Section B.4. One approach for doing so is copying decision variables which couple the subsystems nonlinearly and coupling them again by an additional affine equality constraints which then immideately leads to problems in form of problem (2.12). In this case, an important question is whether or not we "destroy" certain nice properties of the original problems such as constraint qualifications from Definition 2. Next, we show that this is not the case, i.e. that copying variables and adding additional affine coupling constraints preserves LICQ.

Let us assume we have an optimization problem with two blocks of variables ∈ R , ∈ R subject to equality constraints : R × R → R with ≤ + . We now want to show that a introducing auxiliary variables ∈ R preserves LICQ.

If LICQ holds for , we have

$$\text{rank}\left(\nabla g\left(x,y\right)\right) = n\_{\mathcal{H}}$$

in a small neighborhood of ( ★, ★). If we minimize over an extended set of variables (, , ) ∈ R +2 , the constraints are given by

$$\tilde{\mathbf{g}}(\mathbf{x}, \mathbf{y}, z) := \begin{pmatrix} \mathbf{g}(\mathbf{x}, z) \\ \mathbf{y} - z \end{pmatrix} = \mathbf{0}.$$

Moreover

$$
\nabla \tilde{\mathbf{g}}(\mathbf{x}, \mathbf{y}, z) = \begin{pmatrix}
\nabla\_x \tilde{\mathbf{g}}(\mathbf{x}, \mathbf{y}, z) & 0 & \nabla\_z \tilde{\mathbf{g}}(\mathbf{x}, \mathbf{y}, z) \\
0 & I & -I
\end{pmatrix}.
\tag{A.14}$$

By LICQ, the first row of (A.14) has rank . Using the identity

$$\text{rank}\,(A^\top \,\, B^\top)^\top = \text{rank}(A) + \text{rank}(B - BA^\top A)^\top$$

then yields rank (∇(, , )) = + . This shows that LICQ is preserved.

## **A.4 Proof of Theorem 4**

The proof is similar to standard proofs for Newton-type methods from [Die16; NW06; Ber99] and given here for the sake of a self-contained presentation and because of it's importance for the convergence analysis of ALADIN. The idea is to estimate the distance to a local primal-dual solution to (2.12) ★. With the Newton-type iteration (2.7) we get

$$\begin{split} \|q^{k+1} - q^{\star}\| &= \|q^k - q^{\star} - (M^k)^{-1} F(q^k)\| \\ &= \|q^k - q^{\star} - (M^k)^{-1} (F(q^k) - F(q^{\star}))\| \\ &= \|(M^k)^{-1} M^k (q^k - q^{\star}) \\ &\quad - (M^k)^{-1} \int\_0^1 \nabla F(q^{\star} + t(q^k - q^{\star})) (q^k - q^{\star}) dt \| \\ &= \|(M^k)^{-1} (M^k - \nabla F(q^k)) (q^k - q^{\star}) \\ &\quad - \int\_0^1 (M^k)^{-1} \left( \nabla F(q^{\star} + t(q^k - q^{\star})) - \nabla F(q^k) \right) (q^k - q^{\star}) dt \| \\ &\leq \left\|(M^k)^{-1} (M^k - \nabla F(q^k)) \right\| \left\| q^k - q^{\star} \right\| \\ &\quad + \int\_0^1 \left\| (M^k)^{-1} \left( \nabla F(q^{\star} + t(q^k - q^{\star})) - \nabla F(q^k) \right) \right\| dt \, \left\| q^k - q^{\star} \right\| \\ &\leq \kappa \left\| q^k - q^{\star} \right\| + \int\_0^1 \omega \| q^k - q \| dt \, \left\| q^k - q^{\star} \right\| \\ &= \left( \kappa + \frac{\omega}{2} \| q^k - q^{\star} \| \right) \| \left\| q^k - q^{\star} \right\| , \end{split}$$

where we used (in this order) ( ★) = 0, the fundamental theorem of calculus (A.3), continuous differentiability of and adding ±( ) <sup>−</sup>1∇( ) ( − ★) and pulling it into the integral, (in one step) the triangular inequality— (A.2)—submultiplicativity—and standard properties of integrals, the Lipschitz condition (2.10a) and compatibility condition (2.10b) and the positivity of norms, and finally evaluation of the integral. Convergence follows by inserting ∥ − ★∥ < 2(1− ) into the contraction estimate leading to ∥ +<sup>1</sup> − ★∥ < ∥ − ★∥ which is sufficient for convergence to ★. □

# **B Power System Fundamentals**

Using a sufficiently accurate grid models is extremely important. In this section, we will provide a brief introduction to the *AC power system model* providing a sufficient accuracy for decision making on a seconds-to-hours time scale. This model describes the nonlinear relation between the injected/consumed powers at all nodes and the corresponding voltages. One often-employed alternative is the so-called *DC-model* which is often employed in context of power system optimization.<sup>1</sup> Note that this model is often insufficient due to the lack of voltage and reactive power modeling.

## **B.1 The AC model**

Let us make the assumption we use for the AC model explicit. We assume


<sup>1</sup> The DC models assumes constant voltages, no resistances in transmission lines and small voltage angles yielding a *linear model* of power systems enabling to use all benefits from convex optimization. However, while these assumptions represent the physical laws in transmission grids quite well, in distribution grids they are often violated to a large degree for example due to high resistances and high power injections from renewables at weak nodes leading to quite large voltage angle differences [WW13].

Assumption (i) implies that we have a steady-state and transient phenomena can be neglected; assumption (i) together with assumption (ii) allows for using a -equivalent model of transmission-lines/cables. Assumption (iii) is mainly made for simplified presentation, including shunts and transformers poses no significant difficulty from an algorithmic perspective—however it somewhat complicates model derivation. Assumption (iv) implies that we can represent all three phase by their single-phase equivalent. Assumption (v) is made to be able to avoid the difficulty of handling different load models which is quite common under for the here-considered steady-state problems [Ari+18].

*Remark 20 (Discussion of the above assumptions)* Finer-grained models exists allowing to violate the above assumptions. Examples include using hyperbolic equations for the transmission lines (or even partial differential equations) [GS94, Chap 6], using three-phase model for unbalanced grids, using voltagedependent demands [Ari+18], considering voltage-regulated buses and considering transformers and line shunts. However, there is always a trade-off between sufficiently accurate modeling and complexity of the resulting problem. One usually aims at using a model which captures the relevant effects in to sufficient accuracy, but neglects further effects in order to keep the resulting optimization problems tractable. With the set of assumptions given here, we are still able to handle voltage bounds and reactive power.

#### **The AC power flow equations**

We model an electrical grid by an undirected graph = (N, E, Y), where N = {1, . . . , } is the set of buses, E ⊆ N × N is the set of branches and Y : E → C 2 , (, ) ↦→ (, ) is a map assigning a line admittance and a shunt admittance to each branch. We use these the admittances in a -branch model shown in Figure B.1. In order to derive the relationships between bus powers and all bus voltages, we investigate the relationship between the complex powers flowing into/out of the branch , ∈ C and the voltage phasors at the beginning/end of the branch ∈ C and ∈ C by means of circuit theory.

The complex power flowing from node to node is = ∗ , where ∈ C is the complex current over the branch (, ) and ∗ denotes the

Figure B.1: -line model of a branch.

complex-conjugate of ∈ C. Furthermore we have ¯ = ( − ) and ¯ = − 2 . Thus,

$$\mathbf{s}\_{kl} = \boldsymbol{\mu}\_k \left( (\boldsymbol{\mu}\_k^\* - \boldsymbol{\mu}\_l^\*) \mathbf{y}\_{kl}^\* + \boldsymbol{\mu}\_k^\* \frac{\mathbf{y}\_{kl}^{s\*}}{2} \right). \tag{\text{B.I}}$$

By the law of energy conservation, the complex (net) power at node , has to be the sum of all powers flowing into connected transmission lines yielding

$$s\_k = \sum\_{k=1}^{N} \mu\_k \left( (\boldsymbol{\mu}\_k^\* - \boldsymbol{\mu}\_l^\*) \mathbf{y}\_{kl}^\* + \boldsymbol{\mu}\_k^\* \frac{\mathbf{y}\_{kl}^{s\*}}{2} \right) \quad \text{for all} \quad k \in \mathcal{N}. \tag{B.2}$$

By defining a *bus admittance matrix*

$$\begin{aligned} \left[Y\_{kl}\right] &:= \begin{cases} \sum\_{m=1}^N \mathbf{y}\_{km} + \frac{\mathbf{y}\_{km}^s}{2}, & \text{if} \quad k=l, \\\ -\mathbf{y}\_{kl}, & \text{if} \quad k \neq l, \end{cases} \end{aligned}$$

we can write (B.2) compactly as

$$s = \text{diag}(\nu) \, Y^\* \nu^\*. \tag{B.3}$$

Here, = ()=1,..., ∈ C is the vector of all nodal voltages, = ()=1,..., ∈ C denotes the vector of all complex power injections and we define diag(z) ∈ C <sup>n</sup>z×n<sup>z</sup> as a diagonal matrix with the entries of ∈ C on the main diagonal. If multiple components such as loads, generators and storages are connected to one node we have

$$\mathbf{s}\_k = \mathbf{s}\_k^\mathbf{g} - \mathbf{s}\_k^\mathbf{d} - \mathbf{s}\_k^\mathbf{s} \quad \text{for all} \quad k \in \mathcal{N}, \tag{\text{B.4}}$$

151

Figure B.2: Apparent power balance at node .

where g ∈ C denotes apparent power of a generator, d ∈ C denotes the apparent power demand and s ∈ C denotes the apparent power of a storage all connected to bus . If there is no generator/demand/storage at a node, we set the corresponding power to zero.

As numerical algorithms are typically designed for real-valued spaces, we have to transform (B.3) to the real domain. Here, we have different possibilities: we can represent the complex voltages at all nodes = 1, . . . , in cartesian coordinates = re + im or in polar coordinates = , where is the voltage magnitude and is the voltage angle at node . The same choice we have for the complex line admittances . The choice of these coordinates leads to different forms of the power flow equations—for details we refer to the excellent survey [FR16]. Here we use the combination of polar coordinates for and , and rectangular coordinates for = + , which is the most popular formulation for OPF. By doing so and by taking real and imaginary parts of (B.3) we get

$$\text{Re}(\mathbf{s}\_k) = p\_k = \mathbf{v}\_k \sum\_{k=1}^{N} \mathbf{v}\_l (G\_{kl} \cos(\theta\_{kl}) + B\_{kl} \sin(\theta\_{kl})), \tag{\text{B.5a}}$$

$$\operatorname{Im}(s\_k) = q\_k = \nu\_k \sum\_{k=1}^{N} \nu\_l (G\_{kl} \sin(\theta\_{kl}) - B\_{kl} \cos(\theta\_{kl})), \tag{B.5b}$$

for all nodes for all nodes ∈ N, where := − and where is called the *active* power and is called the *reactive* power at node . Note that and are *net powers* (i.e. the residual between generation and demand at node ); substituting and in (B.5) with the corresponding real/imaginary part of equation (B.4) yields power flow equations in terms of component powers.

## **B.2 Power flow analysis**

Recall that classically the only quantities which we can directly influence in power systems are active and reactive power injections from generators g and g . But how to compute the voltage angles and voltage magnitudes for given power injection and demand? This question is subject of so-called *power flow* analysis.

As the power flow equations (B.5) are usually not algebraically solvable for (, ), we have to use iterative methods. Let us write the power flow equations (B.5) with (B.4) as

$${}\_{k}F\_{k}(\chi\_{k}) \coloneqq \begin{pmatrix} p\_{k}^{\text{g}} - p\_{k}^{\text{d}} - p\_{k}^{\text{s}} - \,\,\nu\_{k} \,\sum\_{k=1}^{N} \,\nu\_{l} (G\_{kl}\cos(\theta\_{kl}) + B\_{kl}\sin(\theta\_{kl})) \\ q\_{k}^{\text{g}} - q\_{k}^{\text{d}} - q\_{k}^{\text{s}} - \,\,\nu\_{k} \,\sum\_{k=1}^{N} \,\nu\_{l} (G\_{kl}\sin(\theta\_{kl}) - B\_{kl}\cos(\theta\_{kl})) \end{pmatrix} = 0$$

for all nodes except the first node = 2, . . . , with := (, , ) ⊤. Let us assume that the variables g , g , <sup>d</sup> , <sup>s</sup> , <sup>d</sup> and s are fixed and given here. This yields an implicit function ˜() :<sup>=</sup> ( ())=2,..., , where :<sup>=</sup> ()=2,..., .

Bus one is not included in ˜ and has a special role here: technically speaking, this bus has to compensate the power demand (active and reactive power) which is not covered by the other generators. Physically this follows from the law of energy conservation. Mathematically, this follows from the fact that the last power flow equation can be expressed as a linear combination of the other power flow equations (if no shunts are present as we assume here) implying that one of the power flow equations is "redundant". This leads to the situation that we can not apply the implicit function theorem for locally inverting ˜ (because of the singularity of ∇). Hence, also iterative methods such as Newton-type methods would fail. The common approach for tackling this issue is to introduce a so-called *slack-bus* at a bus with a large generator (typically bus one), where instead of fixing the net powers <sup>1</sup> and 1, we fix the voltage angle <sup>1</sup> and voltage magnitude 1. Hence, the active and reactive powers become "free variables" at this node in the sense that they are a result rather than a precondition of our computation.<sup>2</sup> Thus, the extended implicit function reads

$$F(\mathbf{x}) := \begin{pmatrix} \nu\_1 - \nu\_1^{\mathrm{ref}}, & \theta\_1 - \theta\_1^{\mathrm{ref}}, \quad (F\_k(\chi\_k))\_{k=2,\ldots,N} \end{pmatrix} \stackrel{!}{=} 0,\tag{B.6}$$

with := 1, 1, (, ))=2,..., where ref 1 and ref 1 are constant and given reference values. Note that the components of become linearly independent and ∇() becomes invertible. The above equations can now be solved by standard Newton-type methods from Section 2.1.2, which is then called a *power-flow study*.

*Remark 21 (Different bus types)* Note that we assume here for simplicity that we only have two bus types: "pq-buses", where active and reactive power is given and a slack bus, where the voltage angle and magnitude are given. In many works, also a third bus type is considered ("pv-buses"), where active power and the voltage magnitude are given and the voltage angle and the reactive power are unknown. This comes from the fact that some generators are equipped with voltage regulators, which keep the voltage magnitude at a certain value by injecting reactive power. We leave this out for simplicity here as our goal is to outline main idea of power flow computations and we refer to [GS94; WW13] for further details.

# **B.3 Optimal power flow**

The classical goal of *optimal power flow* is to compute (in a certain sense) optimal set-points for all controllable devices in a grid such that a certain cost function is minimized. Thereby, technical and physical limitations such as power generation limits of the power flow equations (B.5) are considered.

<sup>2</sup> This assumption is technically sound if a large generator is connected to the first bus which can typically be ensured by an appropriate numbering of the buses.

In many cases, the objective function in OPF is the sum of the cost function for active power generation of all generators [Zhu15; FR16]

$$f(\mathbf{x}) \coloneqq \sum\_{k=1}^{N} f\_k^{\mathbb{g}}(p\_k^{\mathbb{g}}),$$

where we have := ()=1,..., with := (, , g , g ) ⊤ and d , <sup>s</sup> , <sup>d</sup> , s fixed and given. In contrast to the power flow problem from the previous subsection, all generator powers are "free" variables now, i.e. we would like to compute them optimally. The individual generator cost functions are often chosen as quadratic functions

$$f\_k^{\mathbb{g}}(p\_k^{\mathbb{g}}) := a\_k \left(p\_k^{\mathbb{g}}\right)^2 + b\_k \left. p\_k^{\mathbb{g}} + c\_k\right|$$

for all = 1, . . . , with g ≡ 0 for non-generator buses. Note that the generator cost generally does *not* depend on the reactive power injection g . Reactive power can be seen as a power to charge and discharge transmission lines and inductive/capacitive loads—thus this energy oscillates between consumer and is thus *not* related to any generation cost (except for slightly changing grid losses).

Power generators are usually subject to active and reactive power limitations. These limitations come from operation limits of the power generation units and for the reactive power mainly from current limitations in the generator stator winding. Moreover, voltage magnitudes have to stay withing certain bounds to ensure proper functioning of the connected devices and to avoid damaging the electrical insulation. The current in transmission lines is also constrained by thermal limits. This is often expressed (although not entirely correct)<sup>3</sup> by limits on the magnitude of the apparent power over lines

$$|s\_{kl}(\mathbf{x})| \coloneqq \sqrt{p\_{kl}(\mathbf{x})^2 + q\_{kl}(\mathbf{x})^2},\tag{\mathbf{B}.7}$$

<sup>3</sup> To be more precise, one would have to limit the magnitude of the current over the transmission line | |. As voltage magnitudes typically deviate only slightly from the nominal voltage, the approximation by an apparent power limit is sufficiently tight in most cases.

where = Re() and = Im() with from (B.1). Stacking the apparent powers for all transmission lines yields

$$|S|^2(x) \coloneqq \left( |s\_{kl}(x)|^2 \right)\_{\mathcal{E}} \dots$$

Combining the above yields a (single-stage) *optimal power flow problem*

$$\min\_{\mathbf{x}} \ f(\mathbf{x})\tag{\mathbf{B}.8a}$$

subject to () = 0, (B.8b)

$$\left|\mathcal{S}\right|^2(\mathfrak{x}) \le \mathfrak{s}^2 \tag{\text{B.8c}}$$

≤ ≤ ,¯ (B.8d)

where and ¯ collect lower/upper bounds on power injections and voltage magnitudes; ¯ 2 collects the upper limits on the apparent power over branches.

*Remark 22 (OPF variants)* Note that there exist many variants of OPF, differing for example in the underlying grid model (DC model vs. AC model, polar coordinates vs. rectangular coordinates, ...) [Zhu15; FR16; MH19], considering uncertain power injection and/or demands [FGM18; M+17; Müh+19; ¨ BCH14] or extensions considering component failure, which is called security constrained OPF [MPG87; Cap+11]. Trying to give an exhaustive overview is beyond the scope of the present thesis and we refer for example to [FR16; FSR12; Zhu15; Cap+11; Cap16] for details. However, the majority of these problem formulations are extensions of the "standard OPF problem" presented here and we see the here presented formulation as a basis for further studies.

## **B.4 Distributed reformulation**

In order to apply distributed optimization algorithms, the above problems have to be in affinely-coupled separable form (2.12). In this section we describe one way of doing so.

In (2.12), any nonlinear equality constraint such as the AC power flow equations (B.5) have to be *local constraints* collected either in or ℎ , such that the coupling between the subsystems is solely affine and in form of (2.12d). There

Figure B.3: Example grid with partitioning.

are multiple ways for achieving affine coupling: here we introduce auxiliary nodes in the middle of transmission lines interconnecting two subsystems and coupling active/reactive power and the voltage phasor at these nodes. Other works such as [Ers15] use coupling in the voltage angles and magnitudes only. Another option is, to introduce variable copies of the bus states connected to border lines and to couple the original states again with their copies by means of (2.12d) without introducing any additional buses [HA09]. Our formulation has the advantage, that no "internal information" from the interior of the subsystems has to be exchanged and that two neighbored subsystems share the cost for the line losses of the coupling line. However, other methods may benefit from a reformulated problem with a smaller dimension due to the absence of auxiliary buses.

Let us assume that a grid partitioning is given (for example by the control the area of responsibility of the corresponding DSOs/TSOs or by borders between different grid levels). This means that we have a set of subsystems R = {1, . . . , }, each equipped with a corresponding subset of buses N ⊂ N where these subsets are disjoint N ∩ N = ∅ for all , ∈ R with ≠ . Let us consider a concrete example: Figure B.3 shows a partitioning for the example 5-bus system from [LB10]. We decompose the node set N = {1, . . . , 5} into three regions R = {1, 2, 3} with N<sup>1</sup> = {1, 5}, N<sup>2</sup> = {2, 3} and N<sup>3</sup> = {4} fulfilling the above assumptions.

#### **Splitting lines between subsystems**

Consider the -branch model from Figure B.1 shown in Figure B.4, where we assume that the line between node and node is a transmission line connecting two subsystems. We introduce two auxiliary buses and which are connected to the corresponding buses and in the subsystems—thus, we replace the original -branch model by two -branch models in series.<sup>4</sup> Regarding the line parameters, we double the series admittance of the original connection and divide the shunt admittance s by two. With that we get an auxiliary buses and , which we collect in an auxiliary node set N and an auxiliary node pair (, ) which we collect in a set A ⊆ N × N and new transmission lines E ⊂ N × N . To obtain equivalence to the original problem, we require

$$
\mu\_m = \mu\_n \quad \text{and} \quad \mathbf{s}\_{k,m} = -\mathbf{s}\_{n,l} \quad \text{for all} \ (m,n) \in \mathcal{H}. \tag{\text{B.9}}
$$

Taking real part and imaginary part of (B.9) yields

$$\mathbf{v}\_m = \mathbf{v}\_n \qquad\qquad\qquad p\_{k,m} = -p\_{n,l} \tag{\mathbf{B}.10a}$$

$$
\theta\_m = \theta\_n,\qquad\qquad q\_{k,m} = -q\_{n,l}.\tag{\text{B.10b}}
$$

for all (, ) ∈ A resulting in an *affine coupling* in form of (2.12d).

In order to simplify presentation, we will assume from now on that the auxiliary buses are already included in N and N respectively. More precisely, this means that we introduce extended bus sets N = N ∪ N and N = N ∪ { ∈ N | there exists (, ) ∈ E }. In the following we use the symbol N for the extended bus set N for simplicity.

<sup>4</sup> Note that this reformulation is equivalent to the original model only if s , = 0. However, from a physical perspective, a series connection of multiple -elements can in general more accurately describe the physical behavior of the branch as it allows for a more accurate approximation of the hyperbolic equations of transmission lines [Sch17, Ch. 10.3].

Figure B.4: Decomposition of a -line model.

#### **The reformulated problem**

With the above, we can construct the power flow equations for each region ∈ R as

$$F\_i(\chi\_i) := (F\_k(\chi\_k))\_{k \in \mathcal{N}\_i},$$

where := ()∈N . Similarly, we have for the line limits and box constraints

$$\|S\|\_{\dot{l}}^2(\mathbf{x}) := \left(\|s\_{k,l}\|(\mathbf{x})^2\right)\_{\mathcal{N}\times\mathcal{N}\_l} \quad \text{and} \qquad \underline{\mathbf{x}}\_{\dot{l}} \le \mathbf{x}\_{\dot{l}} \le \overline{\mathbf{x}}\_{\dot{l}}.$$

for all ∈ R. Furthermore, we define for all ∈ R

$$f\_{\boldsymbol{\ell}}(\mathbf{x}\_{\boldsymbol{\ell}}) := \sum\_{k \in \mathcal{N}\_{\boldsymbol{\ell}}} f\_k^{\mathbf{g}}(p\_k^{\mathbf{g}}) .$$

This yields an OPF problem in form of (2.12)

$$\min\_{\mathbf{x}\_1,\dots,\mathbf{x}\_{\mathcal{R}}} \sum\_{i \in \mathcal{R}} f\_i(\mathbf{x}\_i) \tag{\mathcal{B}.11a}$$

$$\text{subject to} \qquad F\_i(\mathbf{x}\_i) = 0 \qquad\qquad\text{for all} \quad i \in \mathcal{R}, \tag{\text{B.11b}}$$

$$\|S\|\_{\dot{\iota}}^2(\chi\_i) \le \tilde{s}\_i^2 \quad \quad \quad \text{for all} \quad i \in \mathcal{R}, \tag{B.11c}$$

$$\underline{\underline{x}}\_{i} \le \underline{x}\_{i} \le \overline{\underline{x}}\_{i} \qquad\qquad\text{for all}\qquad i \in \mathcal{R},\tag{\text{B.11d}}$$

$$\sum\_{i \in \mathcal{R}} A\_{i} \,\underline{x}\_{i} = 0,\tag{\text{B.11e}}$$

where the matrices ∈ R 4|N |×4|N | are constructed such that the coupling constraints (B.10) are satisfied.

# **C Insights on ADMM**

## **C.1 Modulus switching in ADMM**

As we have seen in Section 5.5, d-ADMM seems to be able to reach only a limited accuracy. In this section we will investigate this behavior on a smallscale example. It turns out that the accuracy is not limited, but the convergence is so slow that is seems like ADMM is not making any progress. Moreover, we show that by choosing a different , one can achieve a higher accuracy but then the overall convergence to that accuracy is slow. So there seems to be a trade-off between the achievable accuracy and the convergence speed.

Consider an unconstrained strongly convex QP with objective function () = 1 2 <sup>⊤</sup> − ℎ <sup>⊤</sup>, where = Í ∈R ≻ 0 and ℎ = Í ∈R ℎ with ADMM similar to the inner problems in bi-level ALADIN from Section 4.2.1. Let

$$H\_1 = \begin{pmatrix} a & c & c \\ c & d & 0 \\ c & 0 & 0 \end{pmatrix}, \quad H\_2 = \begin{pmatrix} a & c+d & c \\ c+d & b & d \\ c & d & 0 \end{pmatrix}, \quad H\_3 = \begin{pmatrix} 0 & d & 0 \\ d & b & d \\ 0 & d & 0 \end{pmatrix},$$

and ℎ<sup>1</sup> = ( 0 0) <sup>⊤</sup>, ℎ<sup>2</sup> = ( 0) <sup>⊤</sup>, ℎ<sup>3</sup> = (0 0) <sup>⊤</sup> with = 10−<sup>4</sup> , = 5, = 10−<sup>6</sup> and = 10−<sup>2</sup> . Then we have a condition number cond() = 6.25 · 10<sup>4</sup> which should not pose too severe difficulties.

Figure C.1 shows the resulting convergence plots for d-ADMM for different values of . One can clearly see that the accuracy in the primal and dual variables seems limited for this problem to round about 10−<sup>5</sup> for = 10−<sup>4</sup> . By choosing a larger of 5 · 10−<sup>3</sup> or 10−<sup>2</sup> , one can achieve higher accuracies, but in this case one needs several thousand iterations to reach an accuracy of 10−<sup>5</sup> in the primal variables. In practice, this is often not an option and especially not in bi-level ALADIN as then the overall communication would be very high. So in practice, tuning of is typically done in a way such that one

Figure C.1: ADMM for solving a unconstrained convex QP.

gets fast convergence in the primal variables in the beginning of the ADMM iterations. Hence, for the here-considered example one would probably choose a around 10−<sup>3</sup> since on can then reach an accuracy of around 10−<sup>4</sup> after several tens/hundreds of iterations. By doing so one also essentially limits the accuracy of ADMM to that level. Note that for small = 10−<sup>5</sup> , ADMM diverges as the individual s are unbounded below since the matrices are indefinite.

From our example one can observe another interesting effect: ADMM seems to switch it's linear convergence modulus while iterating. Especially in case of = 10−<sup>2</sup> illustrates one can observe this behavior. A similar behavior is also observed in other works [NOB16; Gol+14]. To the best of our knowledge this effect is not investigated in the literature so far. One can also draw another conclusion from this example: although for strongly convex problems (which our example is), linear convergence has been shown (e.g. [GB16]), the present example (and the examples from the inner problems in bi-level ALADIN (Section 5.5)) shows that the modulus of convergence for ADMM can be arbitrarily bad. Thus, one has to be very careful with the interpretation of such results—if there considered situation occurs one essentially needs a very large number of iterations for convergence.

Conditions for which slow convergence occurs seems to the best of our knowledge not to exist in the literature. In our particular case the matrices are indefinite leading to non-convex s, which might be one reason. However, investigating this in more detail seems to be an important direction of future work.

# **C.2 Examples: ADMM getting stuck for** → ∞

We give an analytic example where one can observe that ADMM gets stuck for → ∞ and a feasible initial point as predicted by Proposition 1. Moreover, we show that ADMM converges to a feasible but not optimal point in case of an infeasible initialization for the same example. In a second example, we show that evaluating primal feasibility only is in general not sufficient for convergence of ADMM, which is sometimes assumed in the power system context [Ers15; GHT17].

*Example 4 (ADMM getting stuck for* → ∞*)* Consider the problem min, 1 2 2 subject to = and ≥ −1. The partial augmented Lagrangian with respect to the constraint = is L (, , ) = 1 2 <sup>2</sup> + ( − ) + 2 ( − ) 2 .

Case 1: Consider a feasible 0 , i.e. <sup>0</sup> ≥ −1. Applying ADMM (Algorithm 3), minimizing L with respect to yields +<sup>1</sup> + + ( +<sup>1</sup> − ) = 0 and thus +<sup>1</sup> = − 1+ = − / 1/+1 →∞<sup>→</sup> . Minimizing with respect to yields − −( +1− +1 ) = 0 and thus +<sup>1</sup> = +1+ = − / 1/+1 + →∞<sup>→</sup> . The update in step 3) reads +<sup>1</sup> = +( +1− +1 ) = +( +1− +1− ) = 0. This show that, as predicted by Proposition 1 ADMM gets stuck for this example at . In case when the initial point is infeasible however, the situation is different as we will see now.

Case 2: Consider an infeasible point < −1 and an arbitrary but fixed . Then, the minimizer of L with respect to , +<sup>1</sup> will be −1 for → ∞ since ≥ −1 has to be satisfied. After that iteration, we have similar to the above

 +<sup>1</sup> = +<sup>1</sup> = −1 for → ∞ and thus +<sup>1</sup> → 0. This show exemplarily, that the behavior of ADMM for large heavily depends on whether or not the initial point is feasible. Note that in case that in case there are also individual constraints for , the situation might be even more complicated as in this case some kind of "alternating projections" between the constraint sets might occur leading to an oscillatory behavior. Moreover, note that a similar situation occurs if one increases to very large values as ADMM proceeds as for example done in [Ers15; GHT17].

*Example 5 (Primal feasibility is not sufficient for termination)* Consider example 4 again neglecting the constraint ≥ −1. Let us furthermore assume that we use ADMM with primal feasibility (| − | < ) as the only termination criterion. From Example 4 we know that | +<sup>1</sup> − +1 | = | /|. Thus shows that if we choose a large enough such that | /| < , ADMM will imediately terminate although even though ( +1 , +<sup>1</sup> ) is not optimal.

# **D Distributed Optimization and Computer Science**

Distributed optimization and computer science are strongly connected. One reason for this is that distributed optimization has many promising applications in computer science beyond a multi-agent setting. One can identify applications in at least three branches:

	- **–** in image reconstruction [Yan+16; ZK14; XY13];
	- **–** and in support vector machines [FCG10; Boy+11].

In this thesis we present a particular SVM example in Chapter 2. An additional logistic regression example from machine learning is included in the example set of ALADIN- available under [Eng20a].

Also from an historical perspective, many pioneers of distributed optimization are closely connected to computer science. John von Neumann's alternating projection method [Neu49] for example is closely related to ADMM; or John Tsitsiklis with his work on distributed and asynchronous algorithms [Tsi94; BT89]. These roots can also be observed in terminology such as nonlinear/integer/quadratic *programming*.

Moreover, interconnections from systems and control and optimization over networks/theoretical informatics exist. The Bellmann-Ford algorithm for example was invented by one of the most famous people working in systems and control and pioneers of so-called dynamic programming: Richard Bellmann [Bel58]. Even Dijkstra's algorithm can be cast within the framework of dynamic programming which is often considered to be a "control method" [Sni06]. Hence, distributed optimization can be seen as an interdisciplinary field connecting computer science, applied mathematics and systems and control.

## **Distributed optimization vs. parallel computing**

In computer science, a distinction between *parallel computing* and *distributed computing* is often made. Parallel computing usually means that computation is done on multiple processors with one shared memory. Distributed computing typically refers to the case, where computers are connected via a network with a distributed memory and operate via message-passing [BT89; Don+03]. This work focuses mainly on distributed computing altough the developed algorithms can also be used for parallel computing.

Although distributed optimization techniques are interesting for parallel computation, the requirements are typically slightly different. Whereas in parallel computing partitioning is usually done for load balancing—i.e. the problem size handled by each processor is approximately the same for all processors—in our setting the partitioning is usually given a priori by the subsystem boundaries. Furthermore, in parallel computing, communication is usually not of such a big concern because of the fast and high-throughput shared memory.

## **Distributed optimization vs. optimization over networks**

The problems considered in this thesis are different from classical optimization problems in theoretical informatics defined over graphs such as shortest-path problems:<sup>1</sup> here we consider continuous optimization problems without integers and with graphs only as a side-topic.

In contrast to network optimization, in our setting complexity is mostly "in the nodes" rather than "in the network" in the sense that in network flow the nodes have typically only one decision variable and the edges are more important. In

<sup>1</sup> Although certain interconnections may exist. The duality theory from Chapter 2 can for example be fruitfully used in context of network flow problems [Roc98; Ber98].

this thesis, we consider problems, where each node is a (complex) optimization problem on its own, for example itself being a dynamical system or an area of a power grid. Methods from network optimization are not straight-forward to use in this setting. An additional difference to classical optimization over networks is that here the edges in general come with underlying physical laws. Although electrical grids can be seen as a graph with complex-valued weights [Lei+15], for the AC grid model the nodes are coupled via nonlinear equations making classical algorithms from theoretical informatics hard to use.

# **References**

































doi: 10.1109/TPWRS.1987.4335182.



# **Publications**

# **Journal Articles**


# **Conference Papers**


In many engineering domains, systems are composed of partially independent subsystems – power systems are composed of distribution and transmission systems, teams of robots are composed of individual robots, and chemical process systems are composed of vessels, heat exchangers and reactors. Often, these subsystems should reach a common goal such as satisfying a power demand with minimum cost, flying in a formation, or reaching an optimal set-point.

Mathematical optimization techniques are among the most successful tools for controlling such systems optimally with feasibility guarantees. Yet, they are often centralized – all data has to be collected in one central and computationally powerful entity. Methods from distributed optimization overcome this limitation. Classical approaches, however, are often not applicable due to non-convexities. The present work develops one of the first frameworks for distributed optimization with non-convex constraints.

Gedruckt auf FSC-zertifiziertem Papier

A. **ENGELMANN**

**DISTRIBUTED** OPTIMIZATION WITH **APPLICATION** TO **POWER SYSTEMS** AND **CONTROL**